<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Cloudy Things: Simple Cloud Stuff]]></title><description><![CDATA[I'm a Cloud Software Architect and author of the Simple AWS newsletter (www.simpleaws.dev). This blog is just my semi-structured ramblings about pretty basic cloud stuff.]]></description><link>https://blog.guilleojeda.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1685386304250/hdNIA3oLJ.png</url><title>Cloudy Things: Simple Cloud Stuff</title><link>https://blog.guilleojeda.com</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 10 Apr 2026 22:30:03 GMT</lastBuildDate><atom:link href="https://blog.guilleojeda.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Building Intelligent Agentic Applications with Amazon Bedrock and Nova]]></title><description><![CDATA[What Are Agentic AI Architectures?
I won't waste your time with a long, fluffy introduction about how AI is changing the world. Let's get straight to the point: agentic AI architectures are fundamentally different from the prompt-response pattern you...]]></description><link>https://blog.guilleojeda.com/building-intelligent-agentic-applications-with-amazon-bedrock-and-nova</link><guid isPermaLink="true">https://blog.guilleojeda.com/building-intelligent-agentic-applications-with-amazon-bedrock-and-nova</guid><category><![CDATA[AWS]]></category><category><![CDATA[AI]]></category><category><![CDATA[agentic AI]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Fri, 07 Mar 2025 20:44:58 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-what-are-agentic-ai-architectures">What Are Agentic AI Architectures?</h2>
<p>I won't waste your time with a long, fluffy introduction about how AI is changing the world. Let's get straight to the point: agentic AI architectures are fundamentally different from the prompt-response pattern you're probably used to with language models.</p>
<p>In an agentic architecture, the AI doesn't just spit out a response to your input. Instead, it functions as an autonomous agent that breaks down complex tasks into steps, executes those steps by calling the right tools, and uses the results to inform subsequent actions. Think of it as the difference between asking someone a question and hiring them to do a job - the agent actually does work on your behalf rather than just answering.</p>
<p>Amazon Bedrock and the Nova model family are AWS's offering in this space. Bedrock provides the managed infrastructure and orchestration, while Nova models serve as the intelligence. In this article we'll dig into how these technologies work together, the architectural patterns for implementing agentic systems, and the practical considerations for building them at scale.</p>
<h2 id="heading-understanding-amazon-bedrock-and-the-nova-model-family">Understanding Amazon Bedrock and the Nova Model Family</h2>
<p>Amazon Bedrock is AWS's fully managed service for building generative AI applications. It provides a unified API for accessing foundation models, but it's not just a model gateway, it's a comprehensive platform for building, deploying, and running AI applications without managing infrastructure.</p>
<p>The Amazon Nova family is AWS's proprietary set of foundation models, with several variants optimized for different use cases:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Model</td><td>Type</td><td>Context Window</td><td>Multimodal?</td><td>Best For</td><td>Pricing</td></tr>
</thead>
<tbody>
<tr>
<td>Nova Micro</td><td>Text-only</td><td>32K tokens</td><td>No</td><td>Simple tasks, classification, high volume</td><td>$0.000035/1K input tokens, $0.00014/1K output tokens</td></tr>
<tr>
<td>Nova Lite</td><td>Multimodal</td><td>128K tokens</td><td>Yes (text, image, video)</td><td>Balanced performance, routine agent tasks</td><td>$0.00006/1K input tokens, $0.00024/1K output tokens</td></tr>
<tr>
<td>Nova Pro</td><td>Multimodal</td><td>Up to 300K tokens</td><td>Yes (text, image, video)</td><td>Complex reasoning, sophisticated agents</td><td>$0.0008/1K input tokens, $0.0032/1K output tokens</td></tr>
</tbody>
</table>
</div><p>What makes these models particularly suited for agentic applications? First, they're optimized for function calling: the ability to output structured JSON requests for external tools. Second, those large context windows allow agents to maintain extensive conversation history and detailed instructions. Third, the multimodal capabilities (in Lite and Pro) let agents process images and videos alongside text.</p>
<p>Under the hood, Bedrock scales compute resources automatically based on demand. When your agent suddenly gets hit with a traffic spike, AWS provisions additional resources to maintain performance. There's no infrastructure for you to manage, just APIs to call.</p>
<h2 id="heading-agentic-architectures-beyond-simple-prompt-response-systems">Agentic Architectures: Beyond Simple Prompt-Response Systems</h2>
<p>So what exactly makes agentic architectures different from regular LLM applications? Let me break it down with a practical analogy.</p>
<p>A traditional LLM application is like asking someone a question at an information desk: you expect them to answer based on what they know, but they won't leave their desk to do anything for you. An agentic architecture is more like having a personal assistant: they'll not only answer your questions, but also make phone calls, look up information, and take actions on your behalf.</p>
<p>The foundation of this approach is what we call the Reason-Act-Observe loop:</p>
<ol>
<li><p>Reason: The agent analyzes the current state and decides what to do next</p>
</li>
<li><p>Act: It executes an action by calling an external tool/API</p>
</li>
<li><p>Observe: It processes the result from that action</p>
</li>
<li><p>Loop: Based on what it observed, it reasons again about the next step</p>
</li>
</ol>
<p>This cycle continues until the agent determines it has completed the task. It's similar to how you might approach a complex task: you don't solve problems in one leap, but through a series of steps, evaluating after each one.</p>
<p>Here's how this translates to AWS implementations. When you build an agent on Bedrock, you're essentially defining what tools (AWS calls these "action groups") the agent can use, what data sources (knowledge bases) it can reference, and what instructions guide its behavior. The actual orchestration, deciding which tool to use when and chaining the steps together, is handled by Bedrock's agent runtime.</p>
<p>This approach has clear advantages. An agent can handle requests like "Find me flights to New York next weekend, check the weather forecast, and suggest some hotels near Central Park", a request that would be impossible to fulfill in one shot. By breaking it into steps (search flights, check weather, find hotels), and calling APIs for each piece of data, the agent can assemble a comprehensive response.</p>
<p>But this approach isn't without trade-offs. Agentic systems are more complex to configure, potentially slower (since multiple steps and API calls take time), and generally more expensive in terms of both token usage and compute costs. You're paying for the additional reasoning steps and API calls that happen behind the scenes.</p>
<h2 id="heading-bedrock-agents-building-blocks-and-architecture">Bedrock Agents: Building Blocks and Architecture</h2>
<p>A Bedrock Agent consists of several key components:</p>
<p>The <strong>foundation model</strong> is the brain of your agent. For complex agents, Amazon Nova Pro is typically the best choice with its 300K token context window and multimodal capabilities. For simpler tasks or cost-sensitive applications, Nova Lite (128K tokens) or even Nova Micro (32K tokens) might be sufficient.</p>
<p>The <strong>instructions</strong> define what your agent does. This is effectively a system prompt that guides the agent's behavior. For example:</p>
<pre><code class="lang-javascript">You are a travel planning assistant. Your job is to help users find flights, accommodations, and plan itineraries. You have access to flight search APIs, hotel databases, and weather forecasts. Always confirm dates and locations before making any bookings. If the user<span class="hljs-string">'s request is ambiguous, ask clarifying questions.</span>
</code></pre>
<p><strong>Action Groups</strong> (what other frameworks might call "tools") define what your agent can do in the world. Each action group contains:</p>
<ul>
<li><p>A schema (OpenAPI or function schema) describing available actions</p>
</li>
<li><p>A Lambda function implementing those actions</p>
</li>
</ul>
<p>For example, a flight search action might be defined with this schema:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">openapi:</span> <span class="hljs-string">"3.0.0"</span>
<span class="hljs-attr">info:</span>
  <span class="hljs-attr">title:</span> <span class="hljs-string">FlightSearchAPI</span>
  <span class="hljs-attr">version:</span> <span class="hljs-string">"1.0"</span>
<span class="hljs-attr">paths:</span>
  <span class="hljs-string">/flights/search:</span>
    <span class="hljs-attr">get:</span>
      <span class="hljs-attr">summary:</span> <span class="hljs-string">Search</span> <span class="hljs-string">for</span> <span class="hljs-string">flights</span>
      <span class="hljs-attr">description:</span> <span class="hljs-string">Finds</span> <span class="hljs-string">available</span> <span class="hljs-string">flights</span> <span class="hljs-string">between</span> <span class="hljs-string">origin</span> <span class="hljs-string">and</span> <span class="hljs-string">destination</span> <span class="hljs-string">on</span> <span class="hljs-string">specified</span> <span class="hljs-string">dates.</span>
      <span class="hljs-attr">parameters:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">origin</span>
          <span class="hljs-attr">in:</span> <span class="hljs-string">query</span>
          <span class="hljs-attr">required:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">schema:</span>
            <span class="hljs-attr">type:</span> <span class="hljs-string">string</span>
          <span class="hljs-attr">description:</span> <span class="hljs-string">Origin</span> <span class="hljs-string">airport</span> <span class="hljs-string">code</span> <span class="hljs-string">(e.g.,</span> <span class="hljs-string">"JFK"</span><span class="hljs-string">)</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">destination</span>
          <span class="hljs-attr">in:</span> <span class="hljs-string">query</span>
          <span class="hljs-attr">required:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">schema:</span>
            <span class="hljs-attr">type:</span> <span class="hljs-string">string</span>
          <span class="hljs-attr">description:</span> <span class="hljs-string">Destination</span> <span class="hljs-string">airport</span> <span class="hljs-string">code</span> <span class="hljs-string">(e.g.,</span> <span class="hljs-string">"LAX"</span><span class="hljs-string">)</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">departDate</span>
          <span class="hljs-attr">in:</span> <span class="hljs-string">query</span>
          <span class="hljs-attr">required:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">schema:</span>
            <span class="hljs-attr">type:</span> <span class="hljs-string">string</span>
          <span class="hljs-attr">description:</span> <span class="hljs-string">Departure</span> <span class="hljs-string">date</span> <span class="hljs-string">(YYYY-MM-DD)</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">returnDate</span>
          <span class="hljs-attr">in:</span> <span class="hljs-string">query</span>
          <span class="hljs-attr">required:</span> <span class="hljs-literal">false</span>
          <span class="hljs-attr">schema:</span>
            <span class="hljs-attr">type:</span> <span class="hljs-string">string</span>
          <span class="hljs-attr">description:</span> <span class="hljs-string">Return</span> <span class="hljs-string">date</span> <span class="hljs-string">for</span> <span class="hljs-string">round</span> <span class="hljs-string">trip</span> <span class="hljs-string">(YYYY-MM-DD)</span>
</code></pre>
<p>And a Lambda function to implement it:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">lambda_handler</span>(<span class="hljs-params">event, context</span>):</span>
    <span class="hljs-comment"># Extract parameters from the event</span>
    params = event.get(<span class="hljs-string">'parameters'</span>, {})
    origin = params.get(<span class="hljs-string">'origin'</span>)
    destination = params.get(<span class="hljs-string">'destination'</span>)
    depart_date = params.get(<span class="hljs-string">'departDate'</span>)
    return_date = params.get(<span class="hljs-string">'returnDate'</span>)

    <span class="hljs-comment"># In a real implementation, you'd call your flight API</span>
    <span class="hljs-comment"># For this example, we'll return mock data</span>
    flights = [
        {
            <span class="hljs-string">"airline"</span>: <span class="hljs-string">"Oceanic Airlines"</span>,
            <span class="hljs-string">"flightNumber"</span>: <span class="hljs-string">"OA815"</span>,
            <span class="hljs-string">"departureTime"</span>: <span class="hljs-string">"08:15"</span>,
            <span class="hljs-string">"arrivalTime"</span>: <span class="hljs-string">"11:30"</span>,
            <span class="hljs-string">"price"</span>: <span class="hljs-number">299.99</span>,
            <span class="hljs-string">"currency"</span>: <span class="hljs-string">"USD"</span>
        },
        {
            <span class="hljs-string">"airline"</span>: <span class="hljs-string">"United Airlines"</span>,
            <span class="hljs-string">"flightNumber"</span>: <span class="hljs-string">"UA456"</span>,
            <span class="hljs-string">"departureTime"</span>: <span class="hljs-string">"13:45"</span>,
            <span class="hljs-string">"arrivalTime"</span>: <span class="hljs-string">"17:00"</span>,
            <span class="hljs-string">"price"</span>: <span class="hljs-number">349.99</span>,
            <span class="hljs-string">"currency"</span>: <span class="hljs-string">"USD"</span>
        }
    ]

    <span class="hljs-keyword">return</span> {
        <span class="hljs-string">"flights"</span>: flights,
        <span class="hljs-string">"origin"</span>: origin,
        <span class="hljs-string">"destination"</span>: destination,
        <span class="hljs-string">"departDate"</span>: depart_date,
        <span class="hljs-string">"returnDate"</span>: return_date
    }
</code></pre>
<p>Optional <strong>Knowledge Bases</strong> connect your agent to external data. These use vector embeddings (typically generated with Amazon Titan Embeddings) to find relevant information in your data sources. For instance, if you have a knowledge base of travel guides and a user asks about "things to do in Barcelona," the agent can automatically retrieve and reference the Barcelona guide.</p>
<p><strong>Prompt Templates</strong> control how the agent processes information at different stages. There are four main templates:</p>
<ul>
<li><p>Pre-processing (validating user input)</p>
</li>
<li><p>Orchestration (driving the decision-making)</p>
</li>
<li><p>Knowledge Base (handling retrievals)</p>
</li>
<li><p>Post-processing (refining the final answer)</p>
</li>
</ul>
<p>The power of Bedrock Agents lies in how these components work together. When a user sends a request, the agent:</p>
<ol>
<li><p>Processes the user input</p>
</li>
<li><p>Enters an orchestration loop where it repeatedly:</p>
<ul>
<li><p>Decides what to do next (answer directly or use a tool)</p>
</li>
<li><p>If using a tool, calls the corresponding Lambda</p>
</li>
<li><p>Processes the result and decides on next steps</p>
</li>
</ul>
</li>
<li><p>Delivers the final response once the task is complete</p>
</li>
</ol>
<p>All of this happens automatically, your code just calls <code>invoke_agent</code>, and Bedrock handles the complex orchestration behind the scenes.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-knowledge-bases-and-retrieval-augmented-generation">Knowledge Bases and Retrieval-Augmented Generation</h2>
<p>One of the most powerful features of Bedrock Agents is their ability to tap into your data through knowledge bases. This integration enables retrieval-augmented generation (RAG), where the agent grounds its responses in specific documents or data sources.</p>
<p>Setting up a knowledge base involves three steps:</p>
<ol>
<li><p>Prepare your data source. This could be documents in S3, a database, or another repository. Bedrock supports multiple file formats including PDFs, Word docs, text files, HTML, and more.</p>
</li>
<li><p>Create the knowledge base configuration, specifying:</p>
<ul>
<li><p>The data source (e.g., an S3 bucket)</p>
</li>
<li><p>An embedding model (e.g., Amazon Titan Embeddings)</p>
</li>
<li><p>Chunk size and overlap for document splitting</p>
</li>
<li><p>Metadata options for filtering</p>
</li>
</ul>
</li>
<li><p>Associate the knowledge base with your agent.</p>
</li>
</ol>
<p>When a user asks a question, the agent might determine it needs external information. It then:</p>
<ol>
<li><p>Formulates a search query based on the user's question</p>
</li>
<li><p>Sends this query to the knowledge base</p>
</li>
<li><p>Receives relevant document chunks</p>
</li>
<li><p>Incorporates these chunks into its reasoning</p>
</li>
<li><p>Generates a response grounded in this information</p>
</li>
</ol>
<p>There's a trade-off to consider with knowledge bases: adding retrieved content to prompts increases token count and therefore cost. A prompt that might normally be 500 tokens could easily grow to 2,000+ tokens with retrieved content. However, the improvement in answer quality is often worth it.</p>
<p>The chunking strategy significantly impacts retrieval quality. If chunks are too large, they'll contain irrelevant information and waste tokens. If they're too small, they might lose important context. A good starting point is 300-500 token chunks with about 10% overlap, but you'll need to experiment based on your specific content.</p>
<h2 id="heading-performance-and-cost-optimization">Performance and Cost Optimization</h2>
<p>Let's talk numbers: how much will this actually cost you, and how do you keep it reasonable?</p>
<p>The cost of running agentic applications on Bedrock comes down to several factors:</p>
<ol>
<li><p><strong>Model Invocation Costs</strong>: This is the primary expense. Each time the agent "thinks," it invokes the foundation model, which charges per token. For Nova models, input tokens (what you send to the model) are 8 times cheaper than output tokens (what it generates). You can view the prices on the <a target="_blank" href="https://aws.amazon.com/bedrock/pricing/">official Bedrock pricing page</a>.</p>
</li>
<li><p><strong>Tool Execution Costs</strong>: Every tool the agent calls typically invokes a Lambda function and possibly other AWS services, each with their own costs.</p>
</li>
<li><p><strong>Knowledge Base Costs</strong>: These include the initial vectorization of your data, storage of embeddings, and retrieval operations.</p>
</li>
</ol>
<p>Here are some strategies to optimize costs:</p>
<p><strong>Use the right model for the job</strong>. Nova Micro is vastly cheaper than Nova Pro, so consider using it for simpler tasks. You could even implement a cascading approach: try with Micro first, and only escalate to Pro for complex queries.</p>
<p><strong>Optimize prompt sizes</strong>. Keep your instructions concise, trim conversation history when possible, and only include relevant information. Every token costs money.</p>
<p><strong>Take advantage of</strong> <a target="_blank" href="https://newsletter.simpleaws.dev/p/amazon-bedrock-prompt-caching"><strong>prompt caching</strong></a>. Bedrock caches repeated portions of prompts (like instructions or tool definitions) and offers up to 90% discount on those cached tokens. This can significantly reduce costs for agents that have consistent patterns.</p>
<p><strong>For high volume, use provisioned throughput</strong>. If you're consistently running many agent invocations, Provisioned Throughput offers lower per-token rates in exchange for a capacity commitment.</p>
<p><strong>Monitor token usage</strong>. Set up CloudWatch alarms to alert you if usage spikes unexpectedly, which could indicate an issue with your agent's logic or a potential abuse.</p>
<p>As for performance, agent orchestration adds latency because of the multiple steps involved. A simple query might take 2-3 seconds, while a complex one requiring multiple tool calls could take 10+ seconds. Be upfront with users about this latency, and consider implementing a streaming interface to show intermediate progress.</p>
<h2 id="heading-advanced-implementation-patterns">Advanced Implementation Patterns</h2>
<p>Beyond the basics, there are several advanced patterns that can enhance your agents' capabilities and efficiency.</p>
<p><strong>Custom Prompt Templates</strong>: The default Bedrock templates work well, but customizing them gives you more control. For example, you might modify the orchestration template to include specific reasoning steps or decision criteria:</p>
<pre><code class="lang-javascript">Given the user<span class="hljs-string">'s request and available tools, determine the best course of action by:
1. Identifying the specific information or task the user is requesting
2. Checking if you already have all necessary information in the context
3. If not, selecting the appropriate tool or asking a clarifying question
4. Once you have all information, providing a concise answer

Remember:
- Only use tools when necessary, not for information already provided
- Always verify flight details before proceeding with any booking
- If multiple actions are needed, handle them one at a time</span>
</code></pre>
<p><strong>Model Cascading</strong>: You can implement a multi-tier approach where simple queries get handled by lightweight models and only complex ones escalate to more powerful models. This isn't built into Bedrock directly, but you can create a router function that analyzes incoming queries and dispatches them to different agents powered by different models.</p>
<p><strong>Chain of Agents</strong>: For complex workflows, you might create multiple specialized agents that work together. For example, a travel planning system might have separate agents for flight search, hotel recommendations, and itinerary creation. A controller coordinates between these agents, passing information between them as needed.</p>
<p><strong>Hybrid RAG Approaches</strong>: While basic RAG works well, advanced implementations might combine multiple retrieval strategies. For instance, you could implement a system that first attempts semantic search, then falls back to keyword search if the results aren't satisfactory. This can be implemented by customizing your Lambda functions that process knowledge base results.</p>
<p><strong>Integration with Human Workflows</strong>: For high-stakes scenarios, consider integrating human review into the agent's workflow. The agent can handle routine cases autonomously but elevate complex or risky cases to human reviewers. This requires additional orchestration logic, typically implemented through Step Functions or a similar workflow service.</p>
<h2 id="heading-security-and-access-control">Security and Access Control</h2>
<p>Security is particularly important for agentic applications because they actively invoke services and access data. Getting this wrong means your agent could potentially do things you never intended.</p>
<p>The cornerstone of Bedrock Agent security is IAM. Each agent operates with an IAM execution role that defines what AWS resources it can access. Follow the principle of least privilege rigidly - grant only the specific permissions needed for the agent's functions and nothing more.</p>
<p>Here's an example IAM policy for an agent that only needs to call two specific Lambda functions:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-attr">"Statement"</span>: [
        {
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"lambda:InvokeFunction"</span>,
            <span class="hljs-attr">"Resource"</span>: [
                <span class="hljs-string">"arn:aws:lambda:us-east-1:123456789012:function:FlightSearchFunction"</span>,
                <span class="hljs-string">"arn:aws:lambda:us-east-1:123456789012:function:HotelSearchFunction"</span>
            ]
        }
    ]
}
</code></pre>
<p>Additionally, apply resource-based policies on your Lambda functions to ensure they can only be invoked by your Bedrock Agent:</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-attr">"Statement"</span>: [
        {
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-attr">"Principal"</span>: {
                <span class="hljs-attr">"Service"</span>: <span class="hljs-string">"bedrock.amazonaws.com"</span>
            },
            <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"lambda:InvokeFunction"</span>,
            <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"arn:aws:lambda:us-east-1:123456789012:function:FlightSearchFunction"</span>,
            <span class="hljs-attr">"Condition"</span>: {
                <span class="hljs-attr">"StringEquals"</span>: {
                    <span class="hljs-attr">"AWS:SourceAccount"</span>: <span class="hljs-string">"123456789012"</span>
                }
            }
        }
    ]
}
</code></pre>
<p>For Lambda functions that access sensitive data or services, implement additional validation. Don't assume that because your agent is well-behaved, the data it passes to your functions will be well-formed or safe. Validate everything.</p>
<p>If your agent processes personal or sensitive information, consider:</p>
<ul>
<li><p>Using Bedrock Guardrails to filter inappropriate content</p>
</li>
<li><p>Implementing PII detection and masking in your Lambda functions</p>
</li>
<li><p>Encrypting sensitive data at rest and in transit</p>
</li>
<li><p>Setting up comprehensive logging and auditing</p>
</li>
</ul>
<p>If your agent acts on behalf of specific users, ensure user identity and permissions are properly propagated. One approach is to pass user tokens through the agent's session attributes and have your Lambda functions validate these tokens before accessing user-specific resources.</p>
<h2 id="heading-conclusion-the-future-of-agentic-applications-on-aws">Conclusion: The Future of Agentic Applications on AWS</h2>
<p>Agentic applications represent a significant step forward in what's possible with AI. By combining the reasoning capabilities of foundation models with the ability to take actions in the real world, these systems can handle complex tasks that would be impossible for traditional applications.</p>
<p>Amazon Bedrock and the Nova model family provide a robust platform for building these applications. You get the benefit of managed infrastructure and powerful foundation models, while retaining the flexibility to integrate with your existing AWS services and data.</p>
<p>The patterns we've explored in this article, from action groups and knowledge bases to security controls and cost optimizations, aren't just theoretical. They're being applied today in customer service, enterprise productivity, data analysis, and many other domains.</p>
<p>As you start exploring this space, remember that building effective agents requires balancing several factors: technical capability, user experience, security, and cost. The most successful implementations are those that get this balance right for their specific use case.</p>
<p>While the technology is powerful, it's not magic. Agents have limitations: they may sometimes misunderstand requests, take longer than expected to complete tasks, or struggle with highly complex workflows. Set realistic expectations with your users, and design your applications to gracefully handle these edge cases.</p>
<p>Despite these challenges, the potential is enormous. As foundation models continue to improve and AWS enhances the Bedrock platform, the possibilities for intelligent, autonomous applications will only expand. The agents you build today are just the beginning of a new approach to software that's more capable, more contextual, and more helpful than ever before.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Building Reliable Messaging Patterns in AWS with SQS and SNS]]></title><description><![CDATA[Building distributed systems requires putting a lot of attention on communication between components. These components often need to exchange information asynchronously, and that's where message queues and pub/sub systems are the go-to solution. AWS ...]]></description><link>https://blog.guilleojeda.com/building-reliable-messaging-patterns-in-aws-with-sqs-and-sns</link><guid isPermaLink="true">https://blog.guilleojeda.com/building-reliable-messaging-patterns-in-aws-with-sqs-and-sns</guid><category><![CDATA[AWS]]></category><category><![CDATA[architecture]]></category><category><![CDATA[distributed system]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Fri, 20 Dec 2024 15:20:38 GMT</pubDate><content:encoded><![CDATA[<p>Building distributed systems requires putting a lot of attention on communication between components. These components often need to exchange information asynchronously, and that's where message queues and pub/sub systems are the go-to solution. AWS provides two core services for this purpose: <strong>Simple Queue Service (SQS)</strong> and <strong>Simple Notification Service (SNS)</strong>. While these managed services handle the fundamental mechanics of message delivery, you need to understand how to configure them to build reliable distributed systems.</p>
<p>This article explores those configuration details, as well as practical patterns for implementing reliable messaging using SQS and SNS. We'll examine how these services work together, talk about error handling strategies, and learn how to scale messaging infrastructure effectively.</p>
<p>The examples in this article use Node.js, but the patterns apply to any programming language with an AWS SDK.</p>
<h2 id="heading-understanding-aws-messaging-services">Understanding AWS Messaging Services</h2>
<p>AWS messaging services solve different aspects of the distributed communication problem. <strong>SQS</strong> provides managed message queues that enable point-to-point communication between components. When a producer sends a message to an SQS queue, that message will be delivered to <strong>exactly one consumer</strong>. This guarantee makes SQS ideal for workload distribution and task processing.</p>
<p>Here's how to create a basic SQS queue:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> standardQueueConfig = {
    <span class="hljs-attr">QueueName</span>: <span class="hljs-string">'order-processing-queue'</span>,
    <span class="hljs-attr">Attributes</span>: {
        <span class="hljs-attr">MessageRetentionPeriod</span>: <span class="hljs-string">'1209600'</span>,
        <span class="hljs-attr">ReceiveMessageWaitTimeSeconds</span>: <span class="hljs-string">'20'</span>,
        <span class="hljs-attr">VisibilityTimeout</span>: <span class="hljs-string">'30'</span>
    }
};
</code></pre>
<p>The configuration above defines how your queue is going to behave. Message retention period determines how long messages remain available if not processed, in this case 14 days. The receive message wait time enables long polling, reducing empty responses and unnecessary API calls. Visibility timeout specifies how long a message remains hidden during processing, preventing multiple consumers from processing the same message simultaneously.</p>
<p>SQS offers two queue types: <strong>Standard and FIFO (First-In-First-Out)</strong>. Standard queues provide "at-least-once" delivery and support nearly unlimited throughput, but messages may occasionally be delivered out of order or more than once. FIFO queues, on the other hand, guarantee exactly-once processing and strict message ordering, but with limited throughput - 3,000 messages per second with batching, or 300 without.</p>
<p>FIFO queues require additional configuration:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> fifoQueueConfig = {
    <span class="hljs-attr">QueueName</span>: <span class="hljs-string">'order-processing-queue.fifo'</span>,
    <span class="hljs-attr">Attributes</span>: {
        <span class="hljs-attr">FifoQueue</span>: <span class="hljs-string">'true'</span>,
        <span class="hljs-attr">ContentBasedDeduplication</span>: <span class="hljs-string">'true'</span>,
        <span class="hljs-attr">MessageRetentionPeriod</span>: <span class="hljs-string">'1209600'</span>,
        <span class="hljs-attr">ReceiveMessageWaitTimeSeconds</span>: <span class="hljs-string">'20'</span>,
        <span class="hljs-attr">VisibilityTimeout</span>: <span class="hljs-string">'30'</span>
    }
};
</code></pre>
<p>The .fifo suffix in the queue name is mandatory for FIFO queues. Content-based deduplication automatically detects and removes duplicate messages based on their content, though you can also provide explicit deduplication IDs if needed.</p>
<p><strong>SNS</strong>, meanwhile, implements the <strong>publish-subscribe</strong> pattern. Messages sent to an SNS topic are delivered to multiple subscribers simultaneously. This makes SNS ideal for broadcasting notifications, implementing event-driven architectures, and decoupling services. When a message arrives at an SNS topic, it fans out to all subscribed endpoints immediately.</p>
<p>Creating an SNS topic involves specifying its basic attributes and any desired message filtering capabilities:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> topicConfig = {
    <span class="hljs-attr">Name</span>: <span class="hljs-string">'order-events'</span>,
    <span class="hljs-attr">Attributes</span>: {
        <span class="hljs-attr">KmsMasterKeyId</span>: <span class="hljs-string">'alias/aws/sns'</span>,
        <span class="hljs-attr">FilterPolicy</span>: <span class="hljs-built_in">JSON</span>.stringify({
            <span class="hljs-attr">eventType</span>: [<span class="hljs-string">'order_created'</span>, <span class="hljs-string">'order_updated'</span>, <span class="hljs-string">'order_cancelled'</span>]
        })
    }
};
</code></pre>
<p>Message filtering in SNS deserves special attention because it can significantly reduce unnecessary processing. Rather than forcing every subscriber to receive and filter messages themselves, SNS can filter messages at the service level based on message attributes:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// Message filtering configuration</span>
<span class="hljs-keyword">const</span> filterPolicy = {
    <span class="hljs-attr">eventType</span>: [<span class="hljs-string">'order_created'</span>],
    <span class="hljs-attr">priority</span>: [<span class="hljs-string">'HIGH'</span>],
    <span class="hljs-attr">region</span>: [<span class="hljs-string">'us-east-1'</span>, <span class="hljs-string">'us-west-2'</span>]
};

<span class="hljs-keyword">const</span> subscriptionAttributes = {
    <span class="hljs-attr">FilterPolicy</span>: <span class="hljs-built_in">JSON</span>.stringify(filterPolicy)
};
</code></pre>
<p>When applied to a subscription, this filter ensures the subscriber only receives messages matching specific criteria. This filtering happens before message delivery, reducing both processing overhead and potential costs.</p>
<h2 id="heading-implementing-reliable-queue-processing">Implementing Reliable Queue Processing</h2>
<p>Processing messages reliably requires paying special attention to several aspects of the messaging lifecycle.</p>
<p>First, let's look at a basic but reliable message processor:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> processQueue = <span class="hljs-keyword">async</span> (queueUrl) =&gt; {
    <span class="hljs-keyword">const</span> receiveParams = {
        <span class="hljs-attr">QueueUrl</span>: queueUrl,
        <span class="hljs-attr">MaxNumberOfMessages</span>: <span class="hljs-number">10</span>,
        <span class="hljs-attr">WaitTimeSeconds</span>: <span class="hljs-number">20</span>,
        <span class="hljs-attr">MessageAttributeNames</span>: [<span class="hljs-string">'All'</span>]
    };

    <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> sqs.receiveMessage(receiveParams).promise();

        <span class="hljs-keyword">if</span> (!data.Messages) {
            <span class="hljs-keyword">return</span>;
        }

        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> message <span class="hljs-keyword">of</span> data.Messages) {
            <span class="hljs-keyword">try</span> {
                <span class="hljs-keyword">const</span> body = <span class="hljs-built_in">JSON</span>.parse(message.Body);
                <span class="hljs-keyword">await</span> processMessageByType(body);
                <span class="hljs-keyword">await</span> deleteMessage(queueUrl, message.ReceiptHandle);

                <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Successfully processed message <span class="hljs-subst">${message.MessageId}</span>`</span>);
            } <span class="hljs-keyword">catch</span> (error) {
                <span class="hljs-built_in">console</span>.error(<span class="hljs-string">`Error processing message <span class="hljs-subst">${message.MessageId}</span>:`</span>, error);
            }
        }
    } <span class="hljs-keyword">catch</span> (error) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error receiving messages:'</span>, error);
    }
};
</code></pre>
<p>This implementation includes several important reliability features. Long polling reduces unnecessary API calls while ensuring timely message processing. Batch message processing improves throughput and reduces costs. Error handling at both the receive and process levels ensures that failures don't crash the processor. Messages are only deleted after successful processing, ensuring no message is lost due to processing failures.</p>
<p>However, reliable message processing requires more than just careful implementation. We need to handle messages that consistently fail processing, implement proper monitoring, and ensure our system scales appropriately.</p>
<h2 id="heading-handling-failed-messages-with-dead-letter-queues">Handling Failed Messages with Dead Letter Queues</h2>
<p>Messages that can't be processed successfully after multiple attempts need special handling. <strong>Dead Letter Queues (DLQs)</strong> provide a way to isolate these problematic messages for analysis and potential reprocessing. Here's how to implement a good DLQ strategy:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> dlqConfig = {
    <span class="hljs-attr">QueueName</span>: <span class="hljs-string">'order-processing-dlq'</span>,
    <span class="hljs-attr">Attributes</span>: {
        <span class="hljs-attr">MessageRetentionPeriod</span>: <span class="hljs-string">'1209600'</span>
    }
};

<span class="hljs-keyword">const</span> mainQueueConfig = {
    <span class="hljs-attr">QueueName</span>: <span class="hljs-string">'order-processing'</span>,
    <span class="hljs-attr">Attributes</span>: {
        <span class="hljs-attr">RedrivePolicy</span>: <span class="hljs-built_in">JSON</span>.stringify({
            <span class="hljs-attr">deadLetterTargetArn</span>: dlqArn,
            <span class="hljs-attr">maxReceiveCount</span>: <span class="hljs-number">3</span>
        })
    }
};
</code></pre>
<p>The redrive policy automatically moves messages to the DLQ after multiple failed processing attempts. This prevents infinite processing loops while preserving failed messages for analysis. The maxReceiveCount parameter determines how many processing attempts are allowed before a message moves to the DLQ.</p>
<p>Processing messages from a DLQ requires a couple of changes:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> processDLQ = <span class="hljs-keyword">async</span> (dlqUrl) =&gt; {
    <span class="hljs-keyword">const</span> params = {
        <span class="hljs-attr">QueueUrl</span>: dlqUrl,
        <span class="hljs-attr">MaxNumberOfMessages</span>: <span class="hljs-number">10</span>,
        <span class="hljs-attr">WaitTimeSeconds</span>: <span class="hljs-number">20</span>,
        <span class="hljs-attr">MessageAttributeNames</span>: [<span class="hljs-string">'All'</span>]
    };

    <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> sqs.receiveMessage(params).promise();

        <span class="hljs-keyword">if</span> (!data.Messages) {
            <span class="hljs-keyword">return</span>;
        }

        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> message <span class="hljs-keyword">of</span> data.Messages) {
            <span class="hljs-keyword">try</span> {
                <span class="hljs-keyword">const</span> failureAnalysis = <span class="hljs-keyword">await</span> analyzeFailure(message);

                <span class="hljs-keyword">if</span> (failureAnalysis.isRecoverable) {
                    <span class="hljs-keyword">await</span> returnToMainQueue(message);
                } <span class="hljs-keyword">else</span> {
                    <span class="hljs-keyword">await</span> storeFailedMessage(message);
                }

                <span class="hljs-keyword">await</span> deleteMessage(dlqUrl, message.ReceiptHandle);
            } <span class="hljs-keyword">catch</span> (error) {
                <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error processing DLQ message:'</span>, error);
            }
        }
    } <span class="hljs-keyword">catch</span> (error) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error receiving DLQ messages:'</span>, error);
    }
};

<span class="hljs-keyword">const</span> analyzeFailure = <span class="hljs-keyword">async</span> (message) =&gt; {
    <span class="hljs-keyword">const</span> attributes = message.MessageAttributes;
    <span class="hljs-keyword">const</span> messageAge = <span class="hljs-built_in">Date</span>.now() - attributes.SentTimestamp;
    <span class="hljs-keyword">const</span> failureCount = <span class="hljs-built_in">parseInt</span>(attributes.ApproximateReceiveCount);

    <span class="hljs-keyword">return</span> {
        <span class="hljs-attr">isRecoverable</span>: messageAge &lt; <span class="hljs-number">86400000</span> &amp;&amp; failureCount &lt; <span class="hljs-number">5</span>,
        <span class="hljs-attr">failureReason</span>: determineFailureReason(message)
    };
};
</code></pre>
<p>This implementation analyzes failed messages to determine if they're recoverable based on their age and failure count. Recoverable messages can be returned to the main queue for reprocessing, while permanently failed messages are stored for further analysis.</p>
<h2 id="heading-monitoring-and-observability">Monitoring and Observability</h2>
<p>A reliable messaging system requires good monitoring to detect and respond to issues before they impact your applications. <strong>Amazon CloudWatch</strong> provides basic metrics for both SQS and SNS, but effective monitoring requires understanding which metrics actually matter and how to interpret them.</p>
<p>For SQS queues, the ApproximateNumberOfMessages metric indicates how many messages are available for retrieval. However, this number alone doesn't tell the whole story. You also need to monitor ApproximateNumberOfMessagesNotVisible, which shows messages currently being processed, and ApproximateAgeOfOldestMessage, which can indicate processing backlogs or stalled consumers.</p>
<p>Here's how to set up basic queue monitoring:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> setupQueueMonitoring = <span class="hljs-keyword">async</span> (queueUrl) =&gt; {
    <span class="hljs-keyword">const</span> alarmConfig = {
        <span class="hljs-attr">AlarmName</span>: <span class="hljs-string">'QueueMessageAge'</span>,
        <span class="hljs-attr">AlarmDescription</span>: <span class="hljs-string">'Alert when messages are getting old'</span>,
        <span class="hljs-attr">MetricName</span>: <span class="hljs-string">'ApproximateAgeOfOldestMessage'</span>,
        <span class="hljs-attr">Namespace</span>: <span class="hljs-string">'AWS/SQS'</span>,
        <span class="hljs-attr">Dimensions</span>: [{
            <span class="hljs-attr">Name</span>: <span class="hljs-string">'QueueName'</span>,
            <span class="hljs-attr">Value</span>: getQueueNameFromUrl(queueUrl)
        }],
        <span class="hljs-attr">Period</span>: <span class="hljs-number">300</span>,
        <span class="hljs-attr">EvaluationPeriods</span>: <span class="hljs-number">2</span>,
        <span class="hljs-attr">Threshold</span>: <span class="hljs-number">3600</span>,
        <span class="hljs-attr">ComparisonOperator</span>: <span class="hljs-string">'GreaterThanThreshold'</span>,
        <span class="hljs-attr">Statistic</span>: <span class="hljs-string">'Maximum'</span>
    };

    <span class="hljs-keyword">await</span> cloudwatch.putMetricAlarm(alarmConfig).promise();
};
</code></pre>
<p>This configuration alerts you when messages remain unprocessed for more than an hour, which might indicate processing issues. However, CloudWatch metrics alone often don't provide enough visibility into message processing. Custom metrics can provide deeper insights into your system's behavior:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> recordCustomMetrics = <span class="hljs-keyword">async</span> (message, processingResult) =&gt; {
    <span class="hljs-keyword">const</span> metrics = [
        {
            <span class="hljs-attr">MetricName</span>: <span class="hljs-string">'MessageProcessingTime'</span>,
            <span class="hljs-attr">Value</span>: processingResult.duration,
            <span class="hljs-attr">Unit</span>: <span class="hljs-string">'Milliseconds'</span>,
            <span class="hljs-attr">Dimensions</span>: [
                {
                    <span class="hljs-attr">Name</span>: <span class="hljs-string">'MessageType'</span>,
                    <span class="hljs-attr">Value</span>: message.attributes.messageType
                },
                {
                    <span class="hljs-attr">Name</span>: <span class="hljs-string">'Environment'</span>,
                    <span class="hljs-attr">Value</span>: process.env.ENVIRONMENT
                }
            ],
            <span class="hljs-attr">Timestamp</span>: <span class="hljs-keyword">new</span> <span class="hljs-built_in">Date</span>()
        }
    ];

    <span class="hljs-keyword">await</span> cloudwatch.putMetricData({
        <span class="hljs-attr">Namespace</span>: <span class="hljs-string">'CustomMessageProcessing'</span>,
        <span class="hljs-attr">MetricData</span>: metrics
    }).promise();
};
</code></pre>
<p>These custom metrics track processing time by message type, helping you identify performance patterns and potential bottlenecks. You might discover that certain message types consistently take longer to process or fail more frequently than others.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-security-and-access-control">Security and Access Control</h2>
<p>Security in messaging systems isn't just authentication and authorization. It also involves encryption, access control, and secure cross-account communication. Both SQS and SNS support server-side encryption using AWS KMS, which should be enabled for sensitive data (or for any data, really):</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> setupQueueEncryption = <span class="hljs-keyword">async</span> (queueUrl) =&gt; {
    <span class="hljs-keyword">const</span> attributes = {
        <span class="hljs-attr">QueueUrl</span>: queueUrl,
        <span class="hljs-attr">Attributes</span>: {
            <span class="hljs-attr">KmsMasterKeyId</span>: <span class="hljs-string">'alias/aws/sqs'</span>,
            <span class="hljs-attr">Policy</span>: <span class="hljs-built_in">JSON</span>.stringify({
                <span class="hljs-attr">Version</span>: <span class="hljs-string">'2012-10-17'</span>,
                <span class="hljs-attr">Statement</span>: [{
                    <span class="hljs-attr">Effect</span>: <span class="hljs-string">'Deny'</span>,
                    <span class="hljs-attr">Principal</span>: <span class="hljs-string">'*'</span>,
                    <span class="hljs-attr">Action</span>: <span class="hljs-string">'SQS:*'</span>,
                    <span class="hljs-attr">Resource</span>: queueArn,
                    <span class="hljs-attr">Condition</span>: {
                        <span class="hljs-attr">Bool</span>: {
                            <span class="hljs-string">'aws:SecureTransport'</span>: <span class="hljs-literal">false</span>
                        }
                    }
                }]
            })
        }
    };

    <span class="hljs-keyword">await</span> sqs.setQueueAttributes(attributes).promise();
};
</code></pre>
<p>Always remember the principle of least privilege. Producer services should only have permission to send messages, while consumer services should only have permission to receive and delete messages:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> producerPolicy = {
    <span class="hljs-attr">Version</span>: <span class="hljs-string">'2012-10-17'</span>,
    <span class="hljs-attr">Statement</span>: [{
        <span class="hljs-attr">Effect</span>: <span class="hljs-string">'Allow'</span>,
        <span class="hljs-attr">Action</span>: [
            <span class="hljs-string">'sqs:SendMessage'</span>,
            <span class="hljs-string">'sqs:GetQueueUrl'</span>
        ],
        <span class="hljs-attr">Resource</span>: queueArn,
        <span class="hljs-attr">Condition</span>: {
            <span class="hljs-attr">ArnLike</span>: {
                <span class="hljs-string">'aws:SourceArn'</span>: producerServiceArn
            }
        }
    }]
};
</code></pre>
<p>Cross-account messaging adds a bit of complexity. When services in different AWS accounts need to communicate, you must configure both the sender's IAM permissions and the receiving queue's resource policy:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> crossAccountQueuePolicy = {
    <span class="hljs-attr">Version</span>: <span class="hljs-string">'2012-10-17'</span>,
    <span class="hljs-attr">Statement</span>: [{
        <span class="hljs-attr">Effect</span>: <span class="hljs-string">'Allow'</span>,
        <span class="hljs-attr">Principal</span>: {
            <span class="hljs-attr">AWS</span>: sourceAccountArn
        },
        <span class="hljs-attr">Action</span>: <span class="hljs-string">'sqs:SendMessage'</span>,
        <span class="hljs-attr">Resource</span>: queueArn,
        <span class="hljs-attr">Condition</span>: {
            <span class="hljs-attr">StringEquals</span>: {
                <span class="hljs-string">'aws:SourceAccount'</span>: sourceAccountId
            }
        }
    }]
};
</code></pre>
<h2 id="heading-advanced-messaging-patterns">Advanced Messaging Patterns</h2>
<p>There will come a time when what I've showed above isn't enough for your system. Let's explore some advanced patterns that address common distributed system challenges.</p>
<p>Message batching can significantly improve throughput and reduce costs. However, implementing batching requires you to be mindful of how you handle failures and timeouts:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> batchProcessor = <span class="hljs-keyword">async</span> (messages, processor) =&gt; {
    <span class="hljs-keyword">const</span> messageGroups = messages.reduce(<span class="hljs-function">(<span class="hljs-params">groups, message</span>) =&gt;</span> {
        <span class="hljs-keyword">const</span> type = message.MessageAttributes.Type.StringValue;
        groups[type] = groups[type] || [];
        groups[type].push(message);
        <span class="hljs-keyword">return</span> groups;
    }, {});

    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> [type, groupMessages] <span class="hljs-keyword">of</span> <span class="hljs-built_in">Object</span>.entries(messageGroups)) {
        <span class="hljs-keyword">try</span> {
            <span class="hljs-keyword">await</span> processor(groupMessages, type);
            <span class="hljs-keyword">await</span> batchDeleteMessages(queueUrl, groupMessages);
        } <span class="hljs-keyword">catch</span> (error) {
            <span class="hljs-built_in">console</span>.error(<span class="hljs-string">`Error processing message group <span class="hljs-subst">${type}</span>:`</span>, error);

            <span class="hljs-comment">// Handle partial batch failures by deleting successful messages</span>
            <span class="hljs-keyword">if</span> (error.partialSuccess) {
                <span class="hljs-keyword">await</span> batchDeleteMessages(queueUrl, error.successfulMessages);
            }
        }
    }
};
</code></pre>
<p>When messages must be processed in order, such as in event sourcing systems, you need to implement ordering guarantees even with standard queues:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> orderDependentProcessor = <span class="hljs-keyword">async</span> (queueUrl) =&gt; {
    <span class="hljs-keyword">const</span> messageCache = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>();
    <span class="hljs-keyword">const</span> processingOrder = [];

    <span class="hljs-keyword">const</span> processMessageIfReady = <span class="hljs-keyword">async</span> (message) =&gt; {
        <span class="hljs-keyword">const</span> sequenceNumber = <span class="hljs-built_in">parseInt</span>(
            message.MessageAttributes.SequenceNumber.StringValue
        );

        <span class="hljs-keyword">if</span> (sequenceNumber !== processingOrder.length + <span class="hljs-number">1</span>) {
            messageCache.set(sequenceNumber, message);
            <span class="hljs-keyword">return</span>;
        }

        <span class="hljs-keyword">await</span> processMessage(message);
        processingOrder.push(sequenceNumber);

        <span class="hljs-keyword">let</span> nextSequence = sequenceNumber + <span class="hljs-number">1</span>;
        <span class="hljs-keyword">while</span> (messageCache.has(nextSequence)) {
            <span class="hljs-keyword">const</span> nextMessage = messageCache.get(nextSequence);
            messageCache.delete(nextSequence);
            <span class="hljs-keyword">await</span> processMessage(nextMessage);
            processingOrder.push(nextSequence);
            nextSequence++;
        }
    };
};
</code></pre>
<p>Circuit breakers protect downstream services from cascade failures. In messaging systems, circuit breakers can prevent queue processors from overwhelming struggling dependencies, and will isolate a failure preventing it from bringing down the entire system:</p>
<pre><code class="lang-javascript"><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MessageProcessorCircuitBreaker</span> </span>{
    <span class="hljs-keyword">constructor</span>(failureThreshold = 5, resetTimeout = 60000) {
        <span class="hljs-built_in">this</span>.failureCount = <span class="hljs-number">0</span>;
        <span class="hljs-built_in">this</span>.failureThreshold = failureThreshold;
        <span class="hljs-built_in">this</span>.resetTimeout = resetTimeout;
        <span class="hljs-built_in">this</span>.lastFailureTime = <span class="hljs-literal">null</span>;
        <span class="hljs-built_in">this</span>.state = <span class="hljs-string">'CLOSED'</span>;
    }

    <span class="hljs-keyword">async</span> processMessage(message, processor) {
        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.state === <span class="hljs-string">'OPEN'</span>) {
            <span class="hljs-keyword">if</span> (<span class="hljs-built_in">Date</span>.now() - <span class="hljs-built_in">this</span>.lastFailureTime &gt;= <span class="hljs-built_in">this</span>.resetTimeout) {
                <span class="hljs-built_in">this</span>.state = <span class="hljs-string">'HALF_OPEN'</span>;
            } <span class="hljs-keyword">else</span> {
                <span class="hljs-keyword">throw</span> <span class="hljs-keyword">new</span> <span class="hljs-built_in">Error</span>(<span class="hljs-string">'Circuit breaker is OPEN'</span>);
            }
        }

        <span class="hljs-keyword">try</span> {
            <span class="hljs-keyword">const</span> result = <span class="hljs-keyword">await</span> processor(message);

            <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.state === <span class="hljs-string">'HALF_OPEN'</span>) {
                <span class="hljs-built_in">this</span>.state = <span class="hljs-string">'CLOSED'</span>;
                <span class="hljs-built_in">this</span>.failureCount = <span class="hljs-number">0</span>;
            }

            <span class="hljs-keyword">return</span> result;
        } <span class="hljs-keyword">catch</span> (error) {
            <span class="hljs-built_in">this</span>.handleFailure();
            <span class="hljs-keyword">throw</span> error;
        }
    }

    handleFailure() {
        <span class="hljs-built_in">this</span>.failureCount++;
        <span class="hljs-built_in">this</span>.lastFailureTime = <span class="hljs-built_in">Date</span>.now();

        <span class="hljs-keyword">if</span> (<span class="hljs-built_in">this</span>.failureCount &gt;= <span class="hljs-built_in">this</span>.failureThreshold) {
            <span class="hljs-built_in">this</span>.state = <span class="hljs-string">'OPEN'</span>;
        }
    }
}
</code></pre>
<h2 id="heading-performance-and-cost-optimization">Performance and Cost Optimization</h2>
<p>Here's where we talk about service limits, implementing efficient processing patterns, and managing costs effectively. Standard SQS queues offer virtually unlimited throughput, while FIFO queues have specific limits that you need to be mindful of:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> scalingConfig = {
    <span class="hljs-attr">standardQueue</span>: {
        <span class="hljs-attr">batchSize</span>: <span class="hljs-number">10</span>,
        <span class="hljs-attr">concurrentExecutions</span>: <span class="hljs-number">1000</span>,
        <span class="hljs-attr">processingTimeout</span>: <span class="hljs-number">30</span>
    },
    <span class="hljs-attr">fifoQueue</span>: {
        <span class="hljs-attr">maxThroughput</span>: <span class="hljs-number">3000</span>,
        <span class="hljs-attr">batchSize</span>: <span class="hljs-number">10</span>,
        <span class="hljs-attr">messageGroupId</span>: <span class="hljs-string">'orderProcessing'</span>,
        <span class="hljs-attr">deduplicationId</span>: uuid.v4(),
        <span class="hljs-attr">processingTimeout</span>: <span class="hljs-number">30</span>
    }
};
</code></pre>
<p>Cost optimization mostly involves balancing message retention, polling frequency, and batch processing. Long polling reduces API calls and associated costs:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> costOptimizedReceive = <span class="hljs-keyword">async</span> (queueUrl) =&gt; {
    <span class="hljs-keyword">const</span> params = {
        <span class="hljs-attr">QueueUrl</span>: queueUrl,
        <span class="hljs-attr">MaxNumberOfMessages</span>: <span class="hljs-number">10</span>,
        <span class="hljs-attr">WaitTimeSeconds</span>: <span class="hljs-number">20</span>,
        <span class="hljs-attr">AttributeNames</span>: [<span class="hljs-string">'SentTimestamp'</span>],
        <span class="hljs-attr">MessageAttributeNames</span>: [<span class="hljs-string">'MessageType'</span>]
    };

    <span class="hljs-keyword">return</span> <span class="hljs-keyword">await</span> sqs.receiveMessage(params).promise();
};
</code></pre>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Building reliable messaging systems isn't just creating SQS queues and SNS topics and calling it a day. It requires understanding how the services work, how to configure them, and how to use them effectively in distributed systems. Proper error handling, monitoring, and security are just a few of the things you need to be mindful of. The patterns and practices discussed here serve as a foundation for building robust messaging systems, but it's left as an exercise to the reader to adapt them to your specific requirements and constraints.</p>
<p>Remember that reliability in distributed systems isn't about preventing all failures. It's about handling failures gracefully when they occur. Testing your messaging patterns under different failure conditions will help ensure your system remains reliable even when components fail or become overloaded.</p>
<p>As with any system, how your components communicate should evolve with your requirements. Start with simple patterns and add complexity only when required. Monitor your system's behavior, understand your traffic patterns, and adjust your implementation accordingly.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Failover in Amazon RDS Multi-AZ Architectures]]></title><description><![CDATA[Database failures are inevitable. Even with the most reliable hardware and software, something will eventually break. AWS RDS Multi-AZ deployments promise to handle these failures gracefully, automatically failing over to a standby database when prob...]]></description><link>https://blog.guilleojeda.com/failover-in-amazon-rds-multi-az-architectures</link><guid isPermaLink="true">https://blog.guilleojeda.com/failover-in-amazon-rds-multi-az-architectures</guid><category><![CDATA[AWS]]></category><category><![CDATA[Databases]]></category><category><![CDATA[architecture]]></category><category><![CDATA[Amazon Web Services]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Wed, 18 Dec 2024 23:22:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1734563852647/ad80f153-23eb-4ff2-a476-0aa432f7e465.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Database failures are inevitable. Even with the most reliable hardware and software, something will eventually break. AWS RDS Multi-AZ deployments promise to handle these failures gracefully, automatically failing over to a standby database when problems occur. But like many things in distributed systems, the reality is more complex than the marketing suggests.</p>
<p>Let's dive deep into how RDS Multi-AZ really works, what happens during failover, and how to design your applications to handle it properly. Understanding these internals will help you build more reliable applications and troubleshoot issues when they occur.</p>
<h2 id="heading-understanding-amazon-rds-architecture">Understanding Amazon RDS Architecture</h2>
<p>Before we can understand Multi-AZ, we need to understand how RDS works under the hood. RDS is a complex distributed system that manages databases. When you create an RDS instance, you're actually getting several pieces working together.</p>
<p>At the core is an EC2 instance running your chosen database engine. This instance has EBS volumes attached to it for storage, and it's connected to your VPC through Elastic Network Interfaces. There's also a control plane running in AWS's infrastructure that manages everything from automated backups to failover decisions.</p>
<p>This separation between the control plane and data plane is crucial. The control plane runs in AWS's infrastructure, independently of your database instances. This means it can continue making decisions and taking actions even when your database instances are having problems. That's particularly important during failover scenarios.</p>
<p>The storage layer is equally important. Your data lives on EBS volumes, which operate independently from the EC2 instance running your database. This separation of compute and storage enables some of RDS's coolest features, including the storage-level replication that makes Multi-AZ work.</p>
<h2 id="heading-availability-zones-in-aws">Availability Zones in AWS</h2>
<p>AWS documentation often describes Availability Zones as "physically separated locations with independent power, networking, and cooling." That's true, but the key point is that they're engineered for complete failure isolation from other AZs.</p>
<p>AWS runs dedicated fiber connections between AZs in a region, engineered for consistent low latency. These connections typically maintain sub-millisecond latency between AZs, with multiple redundant paths. This high-bandwidth, low-latency connectivity is what makes synchronous replication practical.</p>
<p>The network between AZs isn't part of the public internet. It's a dedicated network owned and operated by AWS, with quality of service controls that prioritize critical traffic like database replication. This matters because replication performance directly impacts how quickly your database can commit transactions in Multi-AZ deployments.</p>
<h2 id="heading-multi-az-approaches-in-amazon-rds">Multi-AZ Approaches in Amazon RDS</h2>
<p>RDS actually offers two different types of Multi-AZ deployments, and the differences matter. Traditional Multi-AZ deployments, which we'll focus on first, use a single primary instance with a standby replica. The newer Multi-AZ DB clusters use a primary instance with two readable standbys. The key difference isn't really the number of standbys, but how replication works.</p>
<p>In traditional Multi-AZ, replication happens at the storage level. When your database writes to disk, that write is synchronously replicated to the standby's EBS volumes before being acknowledged. The standby database instance runs in recovery mode, continuously applying changes it sees in the storage layer.</p>
<p>Multi-AZ DB clusters work differently, using the database engine's native replication. This means the standbys can serve read traffic, and it means replication has different performance characteristics and failure modes. The choice between these approaches depends on your specific needs for read scaling and consistency.</p>
<h2 id="heading-how-rds-multi-az-instance-replication-works">How RDS Multi-AZ Instance Replication Works</h2>
<p>When you write data to a Multi-AZ database, several things happen behind the scenes. First, your write operation arrives at the primary instance. The database engine processes it and writes to its local EBS volume. But before acknowledging the write back to your application, that write must be replicated.</p>
<p>The replication process is handled by EBS, not the database engine. EBS synchronously copies each 16KB block that changes to the standby's EBS volumes. When a write occurs, EBS maintains a replication queue for changed blocks. Each block is checksummed and tracked to ensure consistency between volumes. If the queue starts growing too large, RDS will throttle writes to prevent the standby from falling too far behind.</p>
<p>Behind the scenes, EBS also performs continuous consistency checking between volumes. If it detects inconsistent blocks, it will automatically repair them in the background. This process ensures that the standby's storage is truly a consistent copy of the primary, which is crucial for clean failovers.</p>
<p>Only after both the primary and standby volumes have persisted the changes will the write be acknowledged. This ensures zero data loss if a failover occurs, but it also adds latency to every write operation.</p>
<p>The standby instance runs in recovery mode, continuously monitoring its storage for changes and applying them to its internal state. This means it's ready to take over quickly if needed, but it can't serve queries or accept connections while it's in recovery mode.</p>
<p>The replication process adds latency to every write operation. In typical scenarios, you'll see an additional 0.5-1ms for same-AZ writes and 1-2ms for cross-AZ writes. Large writes can take longer, sometimes adding 2-5ms of latency. This might seem small, but it can add up in write-heavy workloads.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-anatomy-of-an-rds-failover">Anatomy of an RDS Failover</h2>
<p>A failover in RDS isn't a single operation, but a complex sequence of events that happens in several phases. When RDS detects a problem with the primary instance, it doesn't immediately fail over. Instead, it goes through a careful validation process to ensure the failover will succeed.</p>
<p>The detection phase involves multiple health checks. RDS monitors EC2 status checks, EBS volume health, network connectivity, and replication status. It uses a complex decision matrix to determine whether a failure has actually occurred and whether failover is the appropriate response. This process typically takes between 0 and 10 seconds.</p>
<p>Once RDS decides to fail over, it enters the validation phase. It verifies that the standby is healthy, that replication is current, and that all network paths are working. This includes checking storage consistency and ensuring the standby database can actually take over. This typically takes another 5-15 seconds.</p>
<p>The actual failover begins with DNS changes. RDS updates the endpoint's CNAME record to point to the standby instance and adjusts the TTL to 5 seconds to speed up propagation. This process, including propagation time, typically takes 30-60 seconds.</p>
<p>Meanwhile, the promotion phase begins. The standby database stops recovery mode, replays any remaining transactions from its storage, and starts accepting connections. This process typically takes 15-30 seconds, running in parallel with DNS propagation.</p>
<p>Finally, RDS begins provisioning a new standby in the background. This doesn't affect database availability, but it's critical for maintaining high availability for future failures.</p>
<h2 id="heading-building-applications-that-handle-rds-failover">Building Applications That Handle RDS Failover</h2>
<p>Application design for Multi-AZ isn't just about handling database connection failures. You need to think about transaction retry logic, connection pooling, and how your application behaves during the transition period. Here's a Python example that illustrates some key concepts:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pymysql
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> contextlib <span class="hljs-keyword">import</span> contextmanager

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">RDSConnectionManager</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, host, user, password, database</span>):</span>
        self.db_config = {
            <span class="hljs-string">'host'</span>: host,
            <span class="hljs-string">'user'</span>: user,
            <span class="hljs-string">'password'</span>: password,
            <span class="hljs-string">'database'</span>: database
        }

<span class="hljs-meta">    @contextmanager</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_connection</span>(<span class="hljs-params">self</span>):</span>
        conn = <span class="hljs-literal">None</span>
        <span class="hljs-keyword">try</span>:
            conn = self._create_connection()
            <span class="hljs-keyword">yield</span> conn
        <span class="hljs-keyword">except</span> pymysql.Error <span class="hljs-keyword">as</span> e:
            <span class="hljs-keyword">if</span> self._should_retry(e):
                time.sleep(<span class="hljs-number">2</span>)  <span class="hljs-comment"># Basic backoff</span>
                conn = self._create_connection()
                <span class="hljs-keyword">yield</span> conn
            <span class="hljs-keyword">else</span>:
                <span class="hljs-keyword">raise</span>
        <span class="hljs-keyword">finally</span>:
            <span class="hljs-keyword">if</span> conn:
                conn.close()

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_create_connection</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-keyword">return</span> pymysql.connect(**self.db_config)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_should_retry</span>(<span class="hljs-params">self, error</span>):</span>
        <span class="hljs-comment"># Add logic to determine if error is retryable</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
</code></pre>
<p>This code demonstrates connection handling, but real applications need more sophisticated retry logic and connection pooling. Your application should handle various types of errors. Network timeouts might occur during the DNS switch. Transactions might be rolled back during the promotion phase. Connections might fail with various errors depending on exactly when and how they fail. Each of these scenarios needs appropriate handling.</p>
<h2 id="heading-monitoring-and-troubleshooting">Monitoring and Troubleshooting</h2>
<p>Effective monitoring of Multi-AZ deployments requires watching several CloudWatch metrics. <code>ReplicaLag</code> tells you how far behind the standby is. <code>WriteIOPS</code> and <code>WriteLatency</code> help you understand replication performance. <code>ReadIOPS</code> and <code>ReadLatency</code> on the primary help you understand the workload.</p>
<p>But raw metrics aren't enough. You need to understand how these metrics relate to each other and what patterns indicate problems. High <code>WriteLatency</code> combined with increasing <code>ReplicaLag</code> might indicate replication problems. High <code>CPUUtilization</code> might explain increased <code>ReplicaLag</code>. The relationships between metrics often tell you more than individual metrics alone.</p>
<p>CloudWatch alarms should monitor for both immediate problems and trending issues. A spike in <code>ReplicaLag</code> needs immediate attention, but gradually increasing <code>WriteLatency</code> might indicate growing problems that need addressing before they cause failures.</p>
<h2 id="heading-advanced-configurations-and-edge-cases">Advanced Configurations and Edge Cases</h2>
<p>Multi-AZ works with various database engines, but the details vary. MySQL and PostgreSQL handle recovery mode differently, which affects failover timing. Oracle has its own nuances around transaction replay. Understanding these engine-specific details helps you design better applications.</p>
<p>Parameter groups also affect Multi-AZ behavior. Settings that control durability and consistency can impact replication performance. Memory settings affect how quickly the standby can catch up after falling behind. Network timeout settings influence how quickly failures are detected.</p>
<p>Edge cases are particularly important to understand. What happens if both AZs have connectivity issues? How does RDS handle simultaneous instance and storage failures? What if DNS propagation is delayed? These scenarios are rare but understanding them helps you build more resilient systems. Note that this doesn't mean you need to ensure your application can handle these scenarios. Not doing anything is a valid response, but only if you understand the risk first.</p>
<p>Through this deep dive into RDS Multi-AZ, we've seen that while AWS handles much of the complexity, understanding the underlying mechanics helps you build better applications. From the basic architecture to complex failure scenarios, each aspect of Multi-AZ deployments has implications for your application's reliability and performance. So, now that you understand how that works in RDS, go build!</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Relational Databases on AWS: Comparing RDS and Aurora]]></title><description><![CDATA[There are two managed relational database services in AWS: Amazon Relational Database Service (RDS) and Amazon Aurora. Both provide the benefits of a fully managed database solution, but they have distinct features and use cases.
In this article, we'...]]></description><link>https://blog.guilleojeda.com/relational-databases-on-aws-comparing-rds-and-aurora</link><guid isPermaLink="true">https://blog.guilleojeda.com/relational-databases-on-aws-comparing-rds-and-aurora</guid><category><![CDATA[AWS]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Tue, 23 Apr 2024 23:40:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1713915584392/e9aaedcc-757f-4edb-ae8b-400bc59d487e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There are two managed relational database services in AWS: Amazon Relational Database Service (RDS) and Amazon Aurora. Both provide the benefits of a fully managed database solution, but they have distinct features and use cases.</p>
<p>In this article, we'll explore the key features and capabilities of RDS and Aurora, compare their differences, and provide guidance on choosing the right service for your application.</p>
<h2 id="heading-understanding-amazon-rds-relational-database-service">Understanding Amazon RDS (Relational Database Service)</h2>
<p>Let's start by taking a closer look at <strong>Amazon RDS</strong>, AWS's fully managed relational database service. RDS makes it easy to set up, operate, and scale a relational database in the cloud, supporting a wide range of database engines.</p>
<p>RDS Key Features:</p>
<ul>
<li><p>Fully managed database service</p>
</li>
<li><p>Supports multiple database engines: MySQL, PostgreSQL, Oracle, SQL Server, MariaDB</p>
</li>
<li><p>Automatic backups and point-in-time recovery</p>
</li>
<li><p>Multi-AZ deployments for high availability</p>
</li>
<li><p>Read replicas for read scalability</p>
</li>
<li><p>Vertical and horizontal scaling options</p>
</li>
</ul>
<p>RDS Instance Types and Storage:</p>
<p>RDS offers a variety of instance types optimized for different workloads and performance requirements. Instance types range from small burstable instances to large memory-optimized instances.</p>
<p>For storage, RDS provides three options:</p>
<ul>
<li><p>General Purpose (SSD): Balanced performance for a wide range of workloads</p>
</li>
<li><p>Provisioned IOPS (SSD): High-performance storage for I/O-intensive workloads</p>
</li>
<li><p>Magnetic: Cost-effective storage for infrequently accessed data</p>
</li>
</ul>
<p>Pricing:</p>
<p>With RDS, you pay for the database instance hours, storage, I/O requests, and data transfer. RDS pricing varies based on the database engine, instance type, storage type, and region.</p>
<h3 id="heading-rds-backup-and-restore">RDS Backup and Restore</h3>
<p>One of the key benefits of using RDS is the automated backup and restore capabilities. RDS provides two types of backups:</p>
<ul>
<li><p>Automated Backups: RDS automatically takes daily snapshots of your database, allowing you to restore to any point in time within the retention period (up to 35 days).</p>
</li>
<li><p>Manual Snapshots: You can manually create database snapshots at any time, which are stored until you explicitly delete them.</p>
</li>
</ul>
<p>To restore a database from a backup, you simply create a new RDS instance and specify the backup to use. RDS handles the rest, creating a new instance with the restored data.</p>
<p>Point-in-time recovery (PITR) is another powerful feature of RDS. With PITR, you can restore your database to any point in time within the backup retention period, down to the second. This is particularly useful for recovering from accidental data modifications or deletions.</p>
<h3 id="heading-rds-high-availability-and-failover">RDS High Availability and Failover</h3>
<p>High availability is crucial for many applications, and RDS provides several options to ensure your database remains available in the event of a failure.</p>
<p>Multi-AZ Deployments:</p>
<p>With a Multi-AZ deployment, RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone (AZ). If the primary instance fails, RDS automatically fails over to the standby, minimizing downtime.</p>
<p>Multi-AZ deployments provide enhanced durability and fault tolerance, with failover typically completing within a minute or two. This is ideal for production workloads that require high availability.</p>
<p>Read Replicas:</p>
<p>Read replicas are separate database instances that are asynchronously replicated from the primary instance. They are used to offload read traffic from the primary instance and improve read scalability.</p>
<p>You can create up to 5 read replicas per primary instance, within the same region or across different regions. Read replicas can be promoted to standalone instances if needed, providing a way to create independent databases for specific use cases.</p>
<h2 id="heading-understanding-amazon-aurora">Understanding Amazon Aurora</h2>
<p>Amazon Aurora is a fully managed relational database service that is compatible with MySQL and PostgreSQL. It offers the simplicity and cost-effectiveness of open-source databases with the performance and availability of commercial databases.</p>
<p>Aurora Key Features:</p>
<ul>
<li><p>MySQL and PostgreSQL compatible</p>
</li>
<li><p>High-performance storage and caching</p>
</li>
<li><p>Auto-scaling of read replicas</p>
</li>
<li><p>Serverless option for automatic scaling</p>
</li>
<li><p>Global database for multi-region deployments</p>
</li>
<li><p>Continuous backups and point-in-time restore</p>
</li>
</ul>
<p>Aurora Storage and Replication:</p>
<p>Aurora uses a distributed, fault-tolerant, and self-healing storage system that automatically scales up to 128 TB per database instance. It replicates data across multiple AZs, providing high durability and availability.</p>
<p>Aurora's storage is designed for fast, consistent performance. It uses a multi-tier caching architecture that includes an in-memory cache, a buffer pool, and a storage cache, reducing the need for disk I/O and improving performance.</p>
<p>Pricing:</p>
<p>With Aurora, you pay for the database instance hours, storage, I/O requests, and data transfer. Aurora pricing varies based on the database engine (MySQL or PostgreSQL), instance type, and region.</p>
<h3 id="heading-aurora-performance-and-scalability">Aurora Performance and Scalability</h3>
<p>One of the key advantages of Aurora is its high-performance storage and caching architecture. Aurora can deliver up to 5X the throughput of standard MySQL and 3X the throughput of standard PostgreSQL, without requiring any changes to your application code.</p>
<p>Auto-scaling Read Replicas:</p>
<p>Aurora automatically scales read replicas based on the workload, ensuring your database can handle read-heavy traffic patterns. As the read traffic increases, Aurora seamlessly adds new read replicas to the cluster, distributing the load across multiple instances.</p>
<p>Aurora Serverless:</p>
<p>For applications with unpredictable or intermittent workloads, Aurora Serverless provides a fully managed, auto-scaling configuration for Aurora MySQL and PostgreSQL. With Aurora Serverless, your database automatically starts up, shuts down, and scales capacity based on your application's needs.</p>
<p>This is particularly useful for development and testing environments, or applications with variable traffic patterns, as it eliminates the need to manage database capacity manually.</p>
<h3 id="heading-aurora-backup-and-restore">Aurora Backup and Restore</h3>
<p>Like RDS, Aurora provides automated continuous backups and point-in-time restore capabilities. However, Aurora takes it a step further with some additional features.</p>
<p>Continuous Backups:</p>
<p>Aurora automatically takes incremental backups of your database, continuously and transparently, with no impact on performance. These backups are stored in Amazon S3, providing 11 9's of durability.</p>
<p>Backup Retention:</p>
<p>Aurora backups are retained for a default period of 1 day, but you can configure this up to 35 days. Backups are automatically deleted when the retention period expires, or when the DB cluster is deleted.</p>
<p>Point-in-time Restore:</p>
<p>With Aurora, you can restore your database to any point in time within the backup retention period, down to the second. This is similar to RDS PITR, but with the added benefit of Aurora's distributed storage architecture, which enables faster restores.</p>
<p>Database Cloning:</p>
<p>Aurora allows you to create a new database cluster from an existing one, effectively "cloning" the database. This is useful for creating test or development environments, or for performing analytics on a copy of your production data without impacting the live database.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-rds-vs-aurora-key-differences-and-use-cases">RDS vs Aurora: Key Differences and Use Cases</h2>
<p>Now that we've explored the key features and capabilities of RDS and Aurora, let's compare them side by side.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>RDS</td><td>Aurora</td></tr>
</thead>
<tbody>
<tr>
<td>Database Engines</td><td>MySQL, PostgreSQL, Oracle, SQL Server, MariaDB</td><td>MySQL, PostgreSQL</td></tr>
<tr>
<td>Performance</td><td>Good performance for general-purpose workloads</td><td>High-performance storage and caching, optimized for read-heavy workloads</td></tr>
<tr>
<td>Scalability</td><td>Vertical and horizontal scaling, read replicas</td><td>Auto-scaling read replicas, Aurora Serverless for automatic scaling</td></tr>
<tr>
<td>Availability</td><td>Multi-AZ deployments for high availability</td><td>Multi-AZ storage, Global Database for multi-region deployments</td></tr>
<tr>
<td>Backup and Restore</td><td>Automated backups, manual snapshots, point-in-time recovery</td><td>Continuous incremental backups, point-in-time restore, database cloning</td></tr>
<tr>
<td>Compatibility</td><td>Wide range of database engines, easy migration</td><td>MySQL and PostgreSQL compatible, requires migration effort</td></tr>
<tr>
<td>Cost</td><td>Cost-effective for general-purpose workloads</td><td>Higher cost, but better performance and scalability for demanding workloads</td></tr>
</tbody>
</table>
</div><p>Use Cases for RDS:</p>
<ul>
<li><p>Applications with moderate performance and scalability requirements</p>
</li>
<li><p>Workloads that require a specific database engine (e.g., Oracle, SQL Server)</p>
</li>
<li><p>Migrating existing on-premises databases to the cloud</p>
</li>
<li><p>Development and testing environments</p>
</li>
</ul>
<p>Use Cases for Aurora:</p>
<ul>
<li><p>Applications with high-performance and high-scalability requirements</p>
</li>
<li><p>Read-heavy workloads that can benefit from Aurora's caching and auto-scaling capabilities</p>
</li>
<li><p>Applications with unpredictable or variable traffic patterns (using Aurora Serverless)</p>
</li>
<li><p>Global applications that require multi-region database deployments</p>
</li>
</ul>
<h2 id="heading-choosing-the-right-relational-database-service">Choosing the Right Relational Database Service</h2>
<p>Choosing between RDS and Aurora depends on your specific application requirements and workload characteristics. Here are some key factors to consider:</p>
<p>Performance and Scalability:</p>
<p>If your application demands high performance and scalability, particularly for read-heavy workloads, Aurora is the better choice. Its high-performance storage and caching architecture, along with auto-scaling read replicas, make it well-suited for demanding applications.</p>
<p>Database Engine Compatibility:</p>
<p>If your application requires a specific database engine, such as Oracle or SQL Server, RDS is the way to go. RDS supports a wide range of database engines, making it easier to migrate existing applications to the cloud.</p>
<p>Cost Considerations:</p>
<p>For general-purpose workloads with moderate performance requirements, RDS is more cost-effective than Aurora. However, if your application requires the high performance and scalability of Aurora, the additional cost may be justified.</p>
<p>Existing Skills and Expertise:</p>
<p>If your team is already familiar with MySQL or PostgreSQL, both RDS and Aurora are good choices. However, if you have expertise with a specific database engine supported by RDS, such as Oracle or SQL Server, that may be a deciding factor.</p>
<h3 id="heading-when-to-use-rds">When to Use RDS</h3>
<ul>
<li><p>Migrating an existing on-premises database to the cloud</p>
</li>
<li><p>Applications with moderate performance and scalability requirements</p>
</li>
<li><p>Workloads that require a specific database engine not supported by Aurora</p>
</li>
<li><p>Development and testing environments</p>
</li>
</ul>
<p>Example: A web application with a backend database that requires SQL Server compatibility and has moderate traffic and performance requirements.</p>
<h3 id="heading-when-to-use-aurora">When to Use Aurora</h3>
<ul>
<li><p>Building a new, high-performance application from scratch</p>
</li>
<li><p>Applications with demanding read-heavy workloads</p>
</li>
<li><p>Serverless applications with unpredictable traffic patterns</p>
</li>
<li><p>Global applications that require multi-region database deployments</p>
</li>
</ul>
<p>Example: A large-scale e-commerce platform with millions of daily users, requiring high throughput and low latency for product catalog searches and user profile management.</p>
<h2 id="heading-best-practices-for-running-relational-databases-on-aws">Best Practices for Running Relational Databases on AWS</h2>
<p>Regardless of whether you choose RDS or Aurora, here are some best practices to keep in mind when running relational databases on AWS:</p>
<h3 id="heading-performance-optimization">Performance Optimization</h3>
<ul>
<li><p>Choose the appropriate instance type and size based on your workload requirements</p>
</li>
<li><p>Monitor CPU, memory, and I/O utilization to identify bottlenecks and optimize performance</p>
</li>
<li><p>Use caching solutions like ElastiCache to offload read traffic and improve performance</p>
</li>
</ul>
<h3 id="heading-security-best-practices">Security Best Practices</h3>
<ul>
<li><p>Use IAM roles and policies to control access to your database instances</p>
</li>
<li><p>Enable encryption at rest and in transit to protect sensitive data</p>
</li>
<li><p>Regularly apply security patches and updates to your database engine</p>
</li>
<li><p>Use VPC security groups to control network access to your database instances</p>
</li>
</ul>
<h3 id="heading-monitoring-and-logging">Monitoring and Logging</h3>
<ul>
<li><p>Enable and configure Amazon CloudWatch for monitoring database metrics and setting alarms</p>
</li>
<li><p>Use AWS CloudTrail to log and audit API activity related to your database instances</p>
</li>
<li><p>Enable database engine-specific logging, such as MySQL slow query logs or PostgreSQL query planner statistics</p>
</li>
</ul>
<h3 id="heading-scaling-and-high-availability">Scaling and High Availability</h3>
<ul>
<li><p>Use read replicas to scale read traffic and improve performance</p>
</li>
<li><p>Enable Multi-AZ deployments for high availability and automatic failover</p>
</li>
<li><p>Monitor replication lag and ensure it stays within acceptable limits</p>
</li>
<li><p>Test failover scenarios regularly to ensure your application can handle database failures gracefully</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AWS provides two powerful managed services for running them in the cloud: Amazon RDS and Amazon Aurora. RDS is a fully managed service that supports a wide range of database engines, making it a good choice for general-purpose workloads and migrating existing applications to the cloud. Aurora, on the other hand, is a high-performance, MySQL and PostgreSQL-compatible database service that is well-suited for demanding, read-heavy workloads.</p>
<p>When choosing between RDS and Aurora, it's important to consider your application's specific requirements, including performance, scalability, compatibility, and cost. However, in cases where MySQL or PostgreSQL are suitable, Aurora is generally my preferred choice due to its advanced architecture and auto-scaling capabilities.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Containers on AWS: Comparing ECS and EKS]]></title><description><![CDATA[Containers offer a lightweight, portable, and scalable solution for running software consistently across different environments. But as the number of containers grows, managing them becomes increasingly complex. That's where container orchestration c...]]></description><link>https://blog.guilleojeda.com/containers-on-aws-comparing-ecs-and-eks</link><guid isPermaLink="true">https://blog.guilleojeda.com/containers-on-aws-comparing-ecs-and-eks</guid><category><![CDATA[AWS]]></category><category><![CDATA[containers]]></category><category><![CDATA[ECS]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Mon, 22 Apr 2024 23:24:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1713828219083/61d80502-8355-4202-ae0c-5555caa75e4d.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Containers offer a lightweight, portable, and scalable solution for running software consistently across different environments. But as the number of containers grows, managing them becomes increasingly complex. That's where container orchestration comes in.</p>
<p>AWS offers two powerful container orchestration services: <strong>Amazon Elastic Container Service (ECS)</strong> and <strong>Amazon Elastic Kubernetes Service (EKS)</strong>. Both services help you run and scale containerized applications, but they differ in their approach, features, and use cases.</p>
<p>In this article, I'll dive deep into the world of containers on AWS. I'll explore the key features and components of ECS and EKS, compare their similarities and differences, and provide guidance on choosing the right service for your needs. By the end, you'll have a solid understanding of how to leverage these services to build and manage containerized applications on AWS effectively.</p>
<h2 id="heading-understanding-amazon-ecs-elastic-container-service">Understanding Amazon ECS (Elastic Container Service)</h2>
<p>Let's start by looking at Amazon ECS, AWS's fully managed container orchestration service. ECS allows you to run and manage Docker containers at scale without worrying about the underlying infrastructure.</p>
<p>ECS Key Features:</p>
<ul>
<li><p>Fully managed container orchestration</p>
</li>
<li><p>Integration with other AWS services</p>
</li>
<li><p>Support for both EC2 and Fargate launch types</p>
</li>
<li><p>Built-in service discovery and load balancing</p>
</li>
<li><p>IAM integration for security and access control</p>
</li>
</ul>
<p>ECS Components:</p>
<ul>
<li><p>Clusters: Logical grouping of container instances or Fargate capacity</p>
</li>
<li><p>Task Definitions: Blueprints that describe how to run a container</p>
</li>
<li><p>Services: Maintain a specified number of task replicas and handle scaling</p>
</li>
<li><p>Tasks: Instantiation of a Task Definition, representing a running container</p>
</li>
</ul>
<p>ECS Launch Types:</p>
<p>ECS supports two launch types for running containers: EC2 and Fargate.</p>
<ul>
<li><p>EC2: You manage the EC2 instances that make up the ECS cluster. This gives you full control over the infrastructure but requires more management overhead.</p>
</li>
<li><p>Fargate: AWS manages the underlying infrastructure, and you only pay for the resources your containers consume. Fargate abstracts away the EC2 instances, making it easier to focus on your applications.</p>
</li>
</ul>
<p>Pricing:</p>
<p>With ECS, you pay for the AWS resources you use, such as EC2 instances, EBS volumes, and data transfer. Fargate pricing is based on the vCPU and memory resources consumed by your containers.</p>
<h3 id="heading-ecs-architecture-and-components">ECS Architecture and Components</h3>
<p>Let's take a closer look at the key components of ECS and how they work together.</p>
<p><strong>ECS Clusters</strong></p>
<p>An ECS cluster is a logical grouping of container instances or Fargate capacity. It provides the infrastructure to run your containers. You can create clusters using the AWS Management Console, AWS CLI, or CloudFormation templates.</p>
<p><strong>Task Definitions</strong></p>
<p>A Task Definition is a JSON file that describes how to run a container. It specifies the container image, CPU and memory requirements, networking settings, and other configuration details. Task Definitions act as blueprints for creating and running tasks.</p>
<p><strong>Services</strong></p>
<p>An ECS Service maintains a specified number of task replicas and handles scaling. It ensures that the desired number of tasks are running and automatically replaces any failed tasks. Services integrate with ELB for load balancing and service discovery.</p>
<p><strong>Tasks</strong></p>
<p>A Task is an instantiation of a Task Definition, representing a running container. When you create a task, ECS launches the container on a suitable container instance or Fargate capacity, based on the Task Definition and launch type.</p>
<p>Example ECS Cluster and Task Definition:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"cluster"</span>: <span class="hljs-string">"my-cluster"</span>,
  <span class="hljs-attr">"taskDefinition"</span>: <span class="hljs-string">"my-task-definition"</span>,
  <span class="hljs-attr">"desiredCount"</span>: <span class="hljs-number">2</span>,
  <span class="hljs-attr">"launchType"</span>: <span class="hljs-string">"FARGATE"</span>,
  <span class="hljs-attr">"networkConfiguration"</span>: {
    <span class="hljs-attr">"awsvpcConfiguration"</span>: {
      <span class="hljs-attr">"subnets"</span>: [
        <span class="hljs-string">"subnet-12345678"</span>,
        <span class="hljs-string">"subnet-87654321"</span>
      ],
      <span class="hljs-attr">"securityGroups"</span>: [
        <span class="hljs-string">"sg-12345678"</span>
      ],
      <span class="hljs-attr">"assignPublicIp"</span>: <span class="hljs-string">"ENABLED"</span>
    }
  }
}
</code></pre>
<h3 id="heading-ecs-launch-types-ec2-vs-fargate">ECS Launch Types: EC2 vs Fargate</h3>
<p>One key decision when using ECS is choosing between the EC2 and Fargate launch types.</p>
<p><strong>EC2 Launch Type</strong></p>
<p>With the EC2 launch type, you manage the EC2 instances that make up your ECS cluster. This gives you full control over the infrastructure, including instance types, scaling, and networking. However, it also means more management overhead, as you're responsible for patching, scaling, and securing the instances.</p>
<p>Use cases for EC2 launch type:</p>
<ul>
<li><p>Workloads that require specific instance types or configurations</p>
</li>
<li><p>Applications that need to access underlying host resources</p>
</li>
<li><p>Scenarios where you want full control over the infrastructure</p>
</li>
</ul>
<p><strong>Fargate Launch Type</strong></p>
<p>Fargate is a serverless compute engine for containers. It abstracts away the underlying infrastructure, allowing you to focus on your applications. With Fargate, you specify the CPU and memory requirements for your tasks, and ECS manages the rest.</p>
<p>Benefits of Fargate:</p>
<ul>
<li><p>No need to manage EC2 instances or clusters</p>
</li>
<li><p>Pay only for the resources your containers consume</p>
</li>
<li><p>Automatic scaling based on task resource requirements</p>
</li>
<li><p>Simplified infrastructure management</p>
</li>
</ul>
<p>Example of running a containerized application using Fargate:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">Resources:</span>
  <span class="hljs-attr">MyFargateService:</span>
    <span class="hljs-attr">Type:</span> <span class="hljs-string">AWS::ECS::Service</span>
    <span class="hljs-attr">Properties:</span>
      <span class="hljs-attr">Cluster:</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">MyCluster</span>
      <span class="hljs-attr">TaskDefinition:</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">MyTaskDefinition</span>
      <span class="hljs-attr">DesiredCount:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">LaunchType:</span> <span class="hljs-string">FARGATE</span>
      <span class="hljs-attr">NetworkConfiguration:</span>
        <span class="hljs-attr">AwsvpcConfiguration:</span>
          <span class="hljs-attr">AssignPublicIp:</span> <span class="hljs-string">ENABLED</span>
          <span class="hljs-attr">Subnets:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">SubnetA</span>
            <span class="hljs-bullet">-</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">SubnetB</span>
          <span class="hljs-attr">SecurityGroups:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-type">!Ref</span> <span class="hljs-string">MySecurityGroup</span>
</code></pre>
<h2 id="heading-understanding-amazon-eks-elastic-kubernetes-service">Understanding Amazon EKS (Elastic Kubernetes Service)</h2>
<p>Now let's shift gears and explore Amazon EKS, a managed Kubernetes service that makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.</p>
<p>EKS Key Features:</p>
<ul>
<li><p>Fully managed Kubernetes control plane</p>
</li>
<li><p>Integration with AWS services and Kubernetes community tools</p>
</li>
<li><p>Automatic provisioning and scaling of worker nodes</p>
</li>
<li><p>Support for both managed and self-managed node groups</p>
</li>
<li><p>Built-in security and compliance features</p>
</li>
</ul>
<p><strong>EKS Architecture</strong></p>
<p>EKS consists of two main components:</p>
<ul>
<li><p>EKS Control Plane: The control plane is a managed Kubernetes master that runs in an AWS-managed account. It provides the Kubernetes API server, etcd, and other core components.</p>
</li>
<li><p>Worker Nodes: Worker nodes are EC2 instances that run your containers and are registered with the EKS cluster. You can create and manage worker nodes using EKS managed node groups or self-managed worker nodes.</p>
</li>
</ul>
<p>Pricing:</p>
<p>With EKS, you pay for the AWS resources you use, such as the EKS control plane, EC2 instances for worker nodes, EBS volumes, and data transfer. You also pay a hourly rate for the EKS control plane based on the number of Kubernetes API requests.</p>
<h3 id="heading-eks-architecture-and-components">EKS Architecture and Components</h3>
<p>Let's dive deeper into the EKS architecture and its key components.</p>
<p><strong>EKS Control Plane</strong></p>
<p>The EKS control plane is a managed Kubernetes master that runs in an AWS-managed account. It provides the following components:</p>
<ul>
<li><p>Kubernetes API Server: The primary interface for interacting with the Kubernetes cluster</p>
</li>
<li><p>etcd: The distributed key-value store used by Kubernetes to store cluster state</p>
</li>
<li><p>Scheduler: Responsible for scheduling pods onto worker nodes based on resource requirements and constraints</p>
</li>
<li><p>Controller Manager: Manages the core control loops in Kubernetes, such as replica sets and deployments</p>
</li>
</ul>
<p><strong>Worker Nodes</strong></p>
<p>Worker nodes are EC2 instances that run your containers and are registered with the EKS cluster. Each worker node runs the following components:</p>
<ul>
<li><p>Kubelet: The primary node agent that communicates with the Kubernetes API server and manages container runtime</p>
</li>
<li><p>Container Runtime: The runtime environment for running containers, such as Docker or containerd</p>
</li>
<li><p>Kube-proxy: Maintains network rules and performs connection forwarding for Kubernetes services</p>
</li>
</ul>
<p>Example EKS Cluster Configuration:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">eksctl.io/v1alpha5</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterConfig</span>

<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">my-eks-cluster</span>
  <span class="hljs-attr">region:</span> <span class="hljs-string">us-west-2</span>

<span class="hljs-attr">managedNodeGroups:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">my-node-group</span>
    <span class="hljs-attr">instanceType:</span> <span class="hljs-string">t3.medium</span>
    <span class="hljs-attr">minSize:</span> <span class="hljs-number">1</span>
    <span class="hljs-attr">maxSize:</span> <span class="hljs-number">3</span>
    <span class="hljs-attr">desiredCapacity:</span> <span class="hljs-number">2</span>
</code></pre>
<h3 id="heading-eks-managed-vs-self-managed-node-groups">EKS Managed vs Self-Managed Node Groups</h3>
<p>EKS provides two options for managing worker nodes: managed node groups and self-managed worker nodes.</p>
<p><strong>EKS Managed Node Groups</strong></p>
<p>EKS managed node groups automate the provisioning and lifecycle management of worker nodes. Key features include:</p>
<ul>
<li><p>Automatic provisioning and scaling of worker nodes</p>
</li>
<li><p>Integration with AWS services like VPC and IAM</p>
</li>
<li><p>Managed updates and patching for worker nodes</p>
</li>
<li><p>Simplified cluster autoscaler configuration</p>
</li>
</ul>
<p><strong>Self-Managed Worker Nodes</strong></p>
<p>With self-managed worker nodes, you have full control over the provisioning and management of worker nodes. This allows for more customization but also requires more effort to set up and maintain.</p>
<p>Example of creating an EKS managed node group:</p>
<pre><code class="lang-bash">eksctl create nodegroup --cluster my-eks-cluster --name my-node-group --node-type t3.medium --nodes 2 --nodes-min 1 --nodes-max 3
</code></pre>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-ecs-vs-eks-key-differences-and-use-cases">ECS vs EKS: Key Differences and Use Cases</h2>
<p>Now that we've explored the key features and components of ECS and EKS, let's compare them side by side.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>ECS</td><td>EKS</td></tr>
</thead>
<tbody>
<tr>
<td>Orchestration</td><td>AWS-native orchestration</td><td>Kubernetes orchestration</td></tr>
<tr>
<td>Control Plane</td><td>Fully managed by AWS</td><td>Managed Kubernetes control plane</td></tr>
<tr>
<td>Infrastructure Management</td><td>Managed (Fargate) or self-managed (EC2)</td><td>Managed or self-managed worker nodes</td></tr>
<tr>
<td>Ecosystem and Tooling</td><td>AWS-native tooling and integrations</td><td>Kubernetes-native tooling and integrations</td></tr>
<tr>
<td>Learning Curve</td><td>Simpler, AWS-specific concepts</td><td>Steeper, requires Kubernetes knowledge</td></tr>
<tr>
<td>Portability</td><td>Tied to AWS ecosystem</td><td>Portable across Kubernetes-compatible platforms</td></tr>
</tbody>
</table>
</div><p>Use cases for ECS:</p>
<ul>
<li><p>Simpler containerized applications</p>
</li>
<li><p>Workloads that heavily utilize AWS services</p>
</li>
<li><p>Teams more familiar with AWS ecosystem</p>
</li>
<li><p>Serverless applications using Fargate</p>
</li>
</ul>
<p>Use cases for EKS:</p>
<ul>
<li><p>Complex, large-scale containerized applications</p>
</li>
<li><p>Workloads that require Kubernetes-specific features</p>
</li>
<li><p>Teams with Kubernetes expertise</p>
</li>
<li><p>Applications that need to be portable across cloud providers</p>
</li>
</ul>
<h2 id="heading-choosing-the-right-container-orchestration-service">Choosing the Right Container Orchestration Service</h2>
<p>Choosing between ECS and EKS depends on various factors specific to your application and organizational needs.</p>
<p>Factors to consider:</p>
<ul>
<li><p>Application complexity and scalability</p>
</li>
<li><p>Team's skills and familiarity with AWS and Kubernetes</p>
</li>
<li><p>Integration with existing tools and workflows</p>
</li>
<li><p>Long-term container strategy and portability requirements</p>
</li>
</ul>
<h3 id="heading-when-to-use-ecs">When to use ECS</h3>
<ul>
<li><p>Simpler applications with a limited number of microservices</p>
</li>
<li><p>Workloads that primarily use AWS services</p>
</li>
<li><p>Teams more comfortable with AWS tools and concepts</p>
</li>
<li><p>Serverless applications that can benefit from Fargate</p>
</li>
</ul>
<p>Example: A web application consisting of a frontend service, backend API, and database, all running on ECS with Fargate.</p>
<h3 id="heading-when-to-use-eks">When to use EKS</h3>
<ul>
<li><p>Complex applications with a large number of microservices</p>
</li>
<li><p>Workloads that require Kubernetes-specific features like Custom Resource Definitions (CRDs)</p>
</li>
<li><p>Teams with extensive Kubernetes experience</p>
</li>
<li><p>Applications that need to be portable across cloud providers</p>
</li>
</ul>
<p>Example: A large-scale machine learning platform running on EKS, leveraging Kubeflow and other Kubernetes-native tools.</p>
<h2 id="heading-best-practices-for-container-orchestration-on-aws">Best Practices for Container Orchestration on AWS</h2>
<p>Regardless of whether you choose ECS or EKS, here are some best practices to keep in mind:</p>
<ul>
<li><p>Use infrastructure as code (IaC) tools like CloudFormation or Terraform to manage your container orchestration resources</p>
</li>
<li><p>Implement a robust CI/CD pipeline to automate container builds, testing, and deployment</p>
</li>
<li><p>Leverage AWS services like ECR for container image registry and ELB for load balancing</p>
</li>
<li><p>Use IAM roles and policies to enforce least privilege access to AWS resources</p>
</li>
<li><p>Monitor your containerized applications using tools like CloudWatch, Prometheus, or Grafana</p>
</li>
<li><p>Optimize costs by right-sizing your instances, using Spot Instances when appropriate, and leveraging reserved capacity</p>
</li>
</ul>
<h2 id="heading-conclusion">Conclusion</h2>
<p>AWS provides two powerful services for container orchestration: ECS and EKS.</p>
<p>ECS is a fully managed service that offers simplicity and deep integration with the AWS ecosystem. It's well-suited for simpler containerized applications and teams more familiar with AWS tools and concepts.</p>
<p>On the other hand, EKS is a managed Kubernetes service that provides the full power and flexibility of Kubernetes. It's ideal for complex, large-scale applications and teams with Kubernetes expertise.</p>
<p>Ultimately, the choice between ECS and EKS depends on your application requirements, team skills, and long-term container strategy. By understanding the key features, differences, and use cases of each service, you can make an informed decision and build scalable, resilient containerized applications on AWS.</p>
<p>Still, I prefer ECS =)</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Monitoring and Troubleshooting on AWS: CloudWatch, X-Ray, and Beyond]]></title><description><![CDATA[As an AWS user, I'm sure you know that monitoring and troubleshooting are essential for keeping your applications running smoothly. After all, you can't fix what you can't see. But with the sheer number of services and tools available on AWS, it can ...]]></description><link>https://blog.guilleojeda.com/aws-monitoring-troubleshooting-cloudwatch-xray</link><guid isPermaLink="true">https://blog.guilleojeda.com/aws-monitoring-troubleshooting-cloudwatch-xray</guid><category><![CDATA[AWS]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Sat, 23 Mar 2024 16:33:24 GMT</pubDate><content:encoded><![CDATA[<p>As an AWS user, I'm sure you know that monitoring and troubleshooting are essential for keeping your applications running smoothly. After all, you can't fix what you can't see. But with the sheer number of services and tools available on AWS, it can be overwhelming to know where to start.</p>
<p>That's where this article comes in. We'll dive into AWS monitoring and troubleshooting, with some key services like CloudWatch and X-Ray, along with other tools and best practices. By the end, you'll have a better understanding of how to effectively monitor and troubleshoot your AWS applications, so you can spend less time fighting fires and more time building cool stuff.</p>
<h2 id="heading-understanding-aws-cloudwatch">Understanding AWS CloudWatch</h2>
<p>At the heart of AWS monitoring is CloudWatch, a powerful service that collects monitoring and operational data in the form of logs, metrics, and events. Think of it as the central nervous system of your AWS environment, constantly keeping track of everything that's going on.</p>
<h3 id="heading-cloudwatch-metrics">CloudWatch Metrics</h3>
<p>One of the core components of CloudWatch is metrics. CloudWatch Metrics are data points that represent the performance and health of your AWS resources over time. AWS services automatically send metrics to CloudWatch, and you can also publish your own custom metrics.</p>
<p>For example, EC2 instances automatically send metrics like CPU utilization, network traffic, and disk I/O to CloudWatch. RDS databases send metrics like database connections, read/write latency, and free storage space. By monitoring these metrics, you can get a clear picture of how your resources are performing and identify potential issues before they impact your users.</p>
<h3 id="heading-cloudwatch-logs">CloudWatch Logs</h3>
<p>Another key feature of CloudWatch is logs. CloudWatch Logs allows you to collect, monitor, and store log files from various sources, including EC2 instances, Lambda functions, and on-premises servers. You can use CloudWatch Logs to troubleshoot issues, analyze application behavior, and gain insights into user activity.</p>
<p>One of the most powerful features of CloudWatch Logs is the ability to filter and search log data. You can use simple text searches or complex query syntax to find specific log events, making it easy to identify errors, exceptions, or other issues. With CloudWatch Logs Insights, you can even perform real-time log analytics, allowing you to quickly investigate and resolve problems.</p>
<h3 id="heading-cloudwatch-alarms">CloudWatch Alarms</h3>
<p>Of course, collecting metrics and logs is only half the battle. You also need a way to proactively detect and respond to issues. That's where CloudWatch Alarms come in.</p>
<p>CloudWatch Alarms allow you to set thresholds for your metrics and receive notifications when those thresholds are breached. For example, you could create an alarm that triggers when the CPU utilization of an EC2 instance exceeds 80% for more than 5 minutes. When the alarm is triggered, you can have CloudWatch send an email, SMS message, or push notification to your team, or even perform automated actions like scaling up your instances or triggering a Lambda function.</p>
<p>When setting up alarms, it's important to strike a balance between being proactive and being spammed with notifications. A good rule of thumb is to focus on metrics that directly impact the user experience or the stability of your application. You should also carefully consider the thresholds and time periods for your alarms to avoid false positives.</p>
<h3 id="heading-cloudwatch-dashboards">CloudWatch Dashboards</h3>
<p>Finally, CloudWatch Dashboards provide a way to visualize your metrics and logs in a single, customizable view. Dashboards allow you to create graphs, tables, and other widgets based on your CloudWatch data, giving you a real-time overview of your application's health and performance.</p>
<p>When creating dashboards, it's important to focus on the metrics and logs that are most relevant to your team and your users. You should also use clear and concise labels and annotations to help your team quickly understand the data being presented. And don't forget to share your dashboards with your team members, so everyone has access to the same information.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-aws-x-ray-distributed-tracing-for-microservices">AWS X-Ray: Distributed Tracing for Microservices</h2>
<p>While CloudWatch is great for monitoring individual resources and services, it doesn't provide a complete picture of how requests flow through your application. That's where AWS X-Ray comes in.</p>
<p>X-Ray is a distributed tracing service that allows you to track requests as they move through your application, helping you identify performance bottlenecks, errors, and other issues. X-Ray is especially useful for troubleshooting microservices architectures, where requests often span multiple services and resources.</p>
<h3 id="heading-instrumenting-applications-for-x-ray">Instrumenting Applications for X-Ray</h3>
<p>To use X-Ray, you first need to instrument your application code to send tracing data to the X-Ray service. AWS provides X-Ray SDKs for popular programming languages like Java, Node.js, Python, and .NET, which make it easy to add tracing to your code.</p>
<p>When instrumenting your code, it's important to follow best practices like using meaningful segment names, adding annotations and metadata to your traces, and handling errors gracefully. You should also be careful not to over-instrument your code, as this can add unnecessary overhead and complexity.</p>
<h3 id="heading-tracing-requests-with-x-ray">Tracing Requests with X-Ray</h3>
<p>Once your application is instrumented, X-Ray will automatically capture and visualize traces as requests flow through your system. The X-Ray service map provides a high-level overview of your application architecture, showing how services and resources are connected and how requests are routed between them.</p>
<p>By drilling down into individual traces, you can see detailed information about each segment of the request, including response times, errors, and other metadata. This makes it easy to identify performance bottlenecks, such as slow database queries or high network latency, and pinpoint the root cause of issues.</p>
<p>X-Ray also integrates with other AWS services, allowing you to trace requests as they move between services like API Gateway, Lambda, and DynamoDB. This provides a complete end-to-end view of your application, making it easier to troubleshoot issues that span multiple services.</p>
<h3 id="heading-analyzing-and-visualizing-traces">Analyzing and Visualizing Traces</h3>
<p>The X-Ray console provides a powerful interface for analyzing and visualizing your tracing data. You can use the console to view the service map, examine individual traces, and filter and group traces based on various attributes like response time, error rate, or user agent.</p>
<p>One of the most useful features of the X-Ray console is the ability to create custom trace views and dashboards. This allows you to focus on the metrics and traces that are most important to your team, and share those views with other team members.</p>
<p>You can also integrate X-Ray with CloudWatch, allowing you to create alarms based on X-Ray metrics and visualize X-Ray data alongside other CloudWatch metrics. This provides a more comprehensive view of your application's health and performance, making it easier to identify and resolve issues.</p>
<h2 id="heading-monitoring-serverless-applications-on-aws">Monitoring Serverless Applications on AWS</h2>
<p>Serverless architectures, such as those based on AWS Lambda and Step Functions, present unique challenges when it comes to monitoring and troubleshooting. Because serverless functions are ephemeral and can scale rapidly, traditional monitoring approaches may not be effective.</p>
<h3 id="heading-monitoring-aws-lambda-with-cloudwatch">Monitoring AWS Lambda with CloudWatch</h3>
<p>One of the key tools for monitoring AWS Lambda is CloudWatch Logs. By default, Lambda sends log output to CloudWatch Logs, allowing you to view and search log data in real-time. You can use CloudWatch Logs to troubleshoot issues, analyze function behavior, and gain insights into performance and usage patterns.</p>
<p>In addition to logs, Lambda also sends metrics to CloudWatch, including invocations, duration, errors, and throttles. By monitoring these metrics, you can identify performance issues, detect anomalies, and set up alarms to proactively notify you of problems.</p>
<p>When monitoring Lambda functions, it's important to correlate logs and metrics to get a complete picture of function behavior. For example, if you notice a spike in function duration, you can use CloudWatch Logs to investigate the root cause, such as a slow database query or a network issue.</p>
<h3 id="heading-monitoring-aws-step-functions-with-x-ray">Monitoring AWS Step Functions with X-Ray</h3>
<p>For more complex serverless workflows, such as those based on AWS Step Functions, X-Ray can be a powerful tool for monitoring and troubleshooting. By enabling X-Ray tracing for your Step Functions, you can visualize the execution flow of your state machines, identify performance bottlenecks, and pinpoint the root cause of errors.</p>
<p>X-Ray integrates seamlessly with Step Functions, automatically capturing traces as executions move through the state machine. You can use the X-Ray console to view the service map, examine individual executions, and filter and group traces based on various attributes.</p>
<p>One of the most useful features of X-Ray for Step Functions is the ability to correlate traces across Lambda functions and other AWS services. This allows you to see how data flows through your application, identify performance issues, and troubleshoot errors that span multiple services.</p>
<h2 id="heading-other-aws-monitoring-and-troubleshooting-tools">Other AWS Monitoring and Troubleshooting Tools</h2>
<p>While CloudWatch and X-Ray are the core tools for monitoring and troubleshooting on AWS, there are many other services and features that can help you keep your applications running smoothly. Here are a few worth mentioning:</p>
<h3 id="heading-amazon-eventbridge">Amazon EventBridge</h3>
<p>EventBridge is a serverless event bus that makes it easy to build event-driven architectures on AWS. With EventBridge, you can monitor events from a wide range of sources, including AWS services, SaaS applications, and custom applications, and trigger automated actions based on those events.</p>
<p>For example, you could use EventBridge to monitor EC2 instance state changes, capture S3 bucket events, or detect changes to your AWS resources using CloudTrail. You can then use EventBridge rules to trigger Lambda functions, send SNS notifications, or perform other actions in response to those events.</p>
<h3 id="heading-aws-config">AWS Config</h3>
<p>AWS Config is a service that helps you assess, audit, and evaluate the configurations of your AWS resources. With Config, you can continuously monitor and record resource configurations, and receive notifications when those configurations change.</p>
<p>Config is particularly useful for troubleshooting issues related to resource misconfigurations or compliance violations. For example, you could use Config to detect when an S3 bucket is made publicly accessible, or when an EC2 instance is launched without the required security group.</p>
<h3 id="heading-vpc-flow-logs">VPC Flow Logs</h3>
<p>VPC Flow Logs is a feature that allows you to capture information about the IP traffic going to and from your VPC. With Flow Logs, you can monitor network traffic at the interface or subnet level, and gain insights into traffic patterns, security issues, and performance bottlenecks.</p>
<p>Flow Logs can be particularly useful for troubleshooting connectivity issues, detecting unusual traffic patterns, and investigating security incidents. You can use tools like Amazon Athena or Amazon CloudWatch Logs Insights to analyze Flow Log data and identify issues.</p>
<h2 id="heading-best-practices-for-monitoring-and-troubleshooting-on-aws">Best Practices for Monitoring and Troubleshooting on AWS</h2>
<p>Effective monitoring and troubleshooting on AWS requires more than just the right tools and services. It also requires a well-defined strategy, clear objectives, and a commitment to continuous improvement. Here are some best practices to keep in mind:</p>
<ol>
<li><p>Establish clear monitoring and troubleshooting objectives. What are the key metrics and logs that matter most to your application and your users? What are your target response times and error rates? By setting clear objectives upfront, you can focus your monitoring and troubleshooting efforts where they'll have the biggest impact.</p>
</li>
<li><p>Create a comprehensive monitoring strategy. Your monitoring strategy should cover all aspects of your application, from infrastructure and application metrics to logs and traces. It should also define clear roles and responsibilities for your team, as well as processes for incident response and escalation.</p>
</li>
<li><p>Implement proactive and reactive troubleshooting processes. Proactive troubleshooting involves using monitoring data to identify and resolve issues before they impact users. Reactive troubleshooting involves quickly identifying and resolving issues when they do occur. Both approaches are essential for maintaining a reliable and performant application.</p>
</li>
<li><p>Leverage automation and Infrastructure as Code. Automation and Infrastructure as Code (IaC) can help you ensure consistency and reliability across your monitoring and troubleshooting processes. By defining your monitoring configuration as code, you can version control your settings, test changes before applying them, and quickly roll back if needed.</p>
</li>
<li><p>Continuously optimize your approach. Monitoring and troubleshooting is an ongoing process, not a one-time setup. As your application evolves and your usage patterns change, you'll need to continuously optimize your monitoring and troubleshooting approach to ensure it remains effective. This may involve adding new metrics and logs, adjusting alarm thresholds, or refining your troubleshooting processes.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Monitoring and troubleshooting are essential skills for any AWS user, whether you're running a simple web application or a complex microservices architecture. By using tools like CloudWatch and X-Ray, plus other AWS services and best practices, you can gain deep visibility into your application's behavior and quickly resolve issues when they occur.</p>
<p>But effective monitoring and troubleshooting is about more than just tools and technology. It's also about having a clear strategy, well-defined processes, and a culture of continuous improvement. By setting clear objectives, implementing proactive and reactive troubleshooting approaches, and continuously optimizing your monitoring and troubleshooting practices, you can build more reliable, performant, and resilient applications on AWS.</p>
<p>So don't wait until something breaks to start thinking about monitoring and troubleshooting. Start implementing these best practices today, and you'll be well on your way to building better applications on AWS.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Security in AWS: IAM Best Practices and Advanced Techniques]]></title><description><![CDATA[AWS IAM (Identity and Access Management) is the backbone of any AWS security strategy. It's the service that controls who can access your AWS resources and what actions they can perform. Get IAM right, and you're well on your way to a secure cloud de...]]></description><link>https://blog.guilleojeda.com/security-in-aws-iam-best-practices-and-advanced-techniques</link><guid isPermaLink="true">https://blog.guilleojeda.com/security-in-aws-iam-best-practices-and-advanced-techniques</guid><category><![CDATA[Cloud]]></category><category><![CDATA[Security]]></category><category><![CDATA[AWS]]></category><category><![CDATA[IAM]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Wed, 20 Mar 2024 00:41:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1710895203981/6d878ff5-a69d-4333-a1d7-55f758126945.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS IAM (Identity and Access Management) is the backbone of any AWS security strategy. It's the service that controls who can access your AWS resources and what actions they can perform. Get IAM right, and you're well on your way to a secure cloud deployment. Mess it up, and you're leaving the door wide open for all sorts of security nightmares.</p>
<p>In this article, we'll dive deep into IAM best practices and advanced techniques to help you lock down your AWS environment like a pro. We'll start with the fundamentals, then move on to more advanced topics like granular access control, cross-account access, and automating IAM with Infrastructure as Code. By the end, you'll have a solid understanding of how to use IAM to secure your AWS resources and protect your sensitive data.</p>
<h2 id="heading-understanding-iam-fundamentals">Understanding IAM Fundamentals</h2>
<p>Before we jump into the best practices and advanced techniques, let's make sure we're all on the same page with the IAM basics. Understanding these foundational concepts is crucial for designing and implementing an effective IAM strategy.</p>
<h3 id="heading-iam-users-groups-and-roles">IAM Users, groups, and roles</h3>
<p>At the core of IAM are three main identity types: users, groups, and roles. IAM users represent individual people or applications that need access to your AWS resources. IAM groups are collections of IAM users, making it easier to manage permissions for multiple users at once. IAM roles are a bit different: they're not associated with a specific user, but rather are used by AWS services or external identities that need temporary access to your resources.</p>
<h3 id="heading-iam-policies-and-permissions">IAM policies and permissions</h3>
<p>IAM policies are JSON documents that define permissions for IAM identities. They specify what actions an identity can perform on which AWS resources. Policies can be attached to IAM users, groups, or roles, or even directly to AWS resources (more on that later).</p>
<h3 id="heading-resource-based-policies-vs-identity-based-policies">Resource-based policies vs. identity-based policies</h3>
<p>There are two main types of IAM policies: identity-based policies and resource-based policies. Identity-based policies are attached to IAM identities (users, groups, or roles) and define what actions those identities can perform on which resources. Resource-based policies, on the other hand, are attached directly to AWS resources (like S3 buckets or KMS keys) and define who can access those resources and what actions they can perform.</p>
<h3 id="heading-how-iam-interacts-with-other-aws-services">How IAM interacts with other AWS services</h3>
<p>IAM is deeply integrated with other AWS services. It's used to control access to virtually every AWS resource, from EC2 instances to S3 buckets to Lambda functions. Many AWS services also have their own resource-based policies that work in conjunction with IAM policies to provide fine-grained access control.</p>
<h2 id="heading-iam-best-practices">IAM Best Practices</h2>
<p>Now that we've got the fundamentals down, let's dive into some IAM best practices that every AWS user should follow.</p>
<h3 id="heading-principle-of-least-privilege">Principle of least privilege</h3>
<p>The principle of least privilege is the golden rule of IAM. It means only granting users the permissions they need to perform their job duties; no more, no less. This helps minimize the blast radius if a user's credentials are compromised, and makes it easier to audit and manage permissions over time.</p>
<h3 id="heading-proper-iam-user-and-role-management">Proper IAM user and role management</h3>
<p>Managing IAM users and roles can get complex, especially in large organizations. Some key best practices include:</p>
<ul>
<li><p>Create individual IAM users for each person who needs access to AWS, rather than sharing credentials</p>
</li>
<li><p>Use IAM roles for applications and services that need access to AWS resources</p>
</li>
<li><p>Regularly review and remove unused IAM users and roles</p>
</li>
</ul>
<h3 id="heading-using-iam-groups-for-better-organization">Using IAM groups for better organization</h3>
<p>IAM groups make it easier to manage permissions for multiple users at once. By creating groups for different job functions or teams, you can assign permissions at the group level rather than individually. This makes it easier to onboard new users and ensure consistent permissions across your organization.</p>
<h3 id="heading-password-policies-and-mfa-enforcement">Password policies and MFA enforcement</h3>
<p>Strong password policies and multi-factor authentication (MFA) are critical for protecting your IAM users. AWS allows you to set password policies that enforce minimum length, complexity, and rotation requirements. You should also require MFA for all IAM users, especially those with administrative privileges.</p>
<h3 id="heading-regularly-reviewing-and-rotating-iam-credentials">Regularly reviewing and rotating IAM credentials</h3>
<p>Over time, IAM users can accumulate unnecessary permissions, and credentials can become stale or compromised. That's why it's important to regularly review IAM users and their permissions, and rotate access keys and passwords on a regular basis. AWS recommends rotating access keys every 90 days, and immediately revoking credentials for users who leave your organization.</p>
<h3 id="heading-avoiding-use-of-root-user-account">Avoiding use of root user account</h3>
<p>The root user account has unrestricted access to all AWS resources in your account, making it a prime target for attackers. Best practice is to avoid using the root user account for day-to-day tasks, and instead create individual IAM users with specific permissions. You should also enable MFA on the root user account and use it only for tasks that absolutely require root privileges.</p>
<h3 id="heading-implementing-granular-access-control">Implementing Granular Access Control</h3>
<p>One of the most powerful features of IAM is the ability to create fine-grained policies that precisely control access to your AWS resources. Here are some techniques for implementing granular access control:</p>
<h3 id="heading-creating-fine-grained-iam-policies">Creating fine-grained IAM policies</h3>
<p>When creating IAM policies, it's important to be as specific as possible. Instead of granting broad permissions like <code>s3:*</code>, grant only the specific actions needed, like <code>s3:GetObject</code> or <code>s3:PutObject</code>. You can also restrict access to specific resources using ARNs (Amazon Resource Names), and limit permissions to specific IP ranges or VPC endpoints.</p>
<h3 id="heading-using-policy-conditions-for-more-precise-control">Using policy conditions for more precise control</h3>
<p>IAM policy conditions allow you to further refine permissions based on specific criteria. For example, you can use conditions to allow access only during certain time windows, from specific IP ranges, or for requests that include certain headers or parameters.</p>
<h3 id="heading-leveraging-iam-policy-variables">Leveraging IAM policy variables</h3>
<p>IAM policy variables allow you to create dynamic policies that adapt to your environment. For example, you can use the <code>aws:username</code> variable to grant users access to their own home directory in an S3 bucket, or the <code>aws:SourceIp</code> variable to restrict access based on the requester's IP address.</p>
<h3 id="heading-combining-multiple-policies-for-complex-permissions">Combining multiple policies for complex permissions</h3>
<p>In some cases, you may need to combine multiple policies to achieve the desired level of access control. For example, you might use an identity-based policy to grant broad permissions to a group of users, then use a resource-based policy to further restrict access to specific resources.</p>
<h2 id="heading-real-world-examples-of-granular-access-control-in-aws">Real-world examples of granular access control in AWS</h2>
<p>Let's look at a couple real-world examples of granular access control in action:</p>
<h3 id="heading-granting-read-only-access-to-an-s3-bucket-for-a-specific-iam-user">Granting read-only access to an S3 bucket for a specific IAM user</h3>
<pre><code class="lang-json">{
  <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
  <span class="hljs-attr">"Statement"</span>: [
    {
      <span class="hljs-attr">"Sid"</span>: <span class="hljs-string">"ReadOnlyAccess"</span>,
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: [
        <span class="hljs-string">"s3:GetObject"</span>,
        <span class="hljs-string">"s3:ListBucket"</span>
      ],
      <span class="hljs-attr">"Resource"</span>: [
        <span class="hljs-string">"arn:aws:s3:::my-bucket"</span>,
        <span class="hljs-string">"arn:aws:s3:::my-bucket/*"</span>
      ]
    }
  ]
}
</code></pre>
<h3 id="heading-allowing-an-ec2-instance-to-access-s3-but-only-from-a-specific-vpc-endpoint">Allowing an EC2 instance to access S3, but only from a specific VPC endpoint</h3>
<pre><code class="lang-json">{
  <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
  <span class="hljs-attr">"Statement"</span>: [
    {
      <span class="hljs-attr">"Sid"</span>: <span class="hljs-string">"AccessFromVPCEndpoint"</span>,
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"s3:*"</span>,
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"*"</span>,
      <span class="hljs-attr">"Condition"</span>: {
        <span class="hljs-attr">"StringEquals"</span>: {
          <span class="hljs-attr">"aws:sourceVpce"</span>: <span class="hljs-string">"vpce-1a2b3c4d"</span>
        }
      }
    }
  ]
}
</code></pre>
<h2 id="heading-cross-account-access-and-iam-roles">Cross-Account Access and IAM Roles</h2>
<p>In many organizations, you'll need to grant access to AWS resources across multiple accounts. That's where IAM roles and cross-account access come in.</p>
<h3 id="heading-understanding-cross-account-access">Understanding cross-account access</h3>
<p>Cross-account access allows IAM users or roles in one AWS account to access resources in another account. This is useful for scenarios like granting developers access to a production account, or allowing a central security team to monitor multiple accounts.</p>
<h3 id="heading-using-iam-roles-for-secure-access-delegation">Using IAM roles for secure access delegation</h3>
<p>IAM roles are the preferred way to grant cross-account access. Instead of sharing access keys or passwords, you create an IAM role in the target account and grant permissions to the trusted entity (user or role) in the source account. The trusted entity can then assume the role and access resources in the target account.</p>
<h3 id="heading-assuming-roles-vs-using-access-keys">Assuming roles vs. using access keys</h3>
<p>When accessing resources across accounts, it's best to assume an IAM role rather than using access keys. Access keys are long-term credentials that can be easily leaked or compromised, while IAM roles provide temporary, short-lived credentials that automatically expire.</p>
<h3 id="heading-best-practices-for-managing-cross-account-access">Best practices for managing cross-account access</h3>
<p>Some best practices for managing cross-account access include:</p>
<ul>
<li><p>Use IAM roles for cross-account access instead of sharing long-term access keys</p>
</li>
<li><p>Limit the permissions granted to cross-account roles to the minimum necessary</p>
</li>
<li><p>Regularly review and audit cross-account access</p>
</li>
<li><p>Use external ID's to prevent the confused deputy problem</p>
</li>
</ul>
<h2 id="heading-securing-access-to-aws-resources">Securing Access to AWS Resources</h2>
<p>In addition to identity-based policies, AWS also supports resource-based policies that allow you to control access to specific resources like S3 buckets, KMS keys, and Lambda functions.</p>
<h3 id="heading-using-resource-based-policies-eg-s3-bucket-policies">Using resource-based policies (e.g., S3 bucket policies)</h3>
<p>Resource-based policies are attached directly to an AWS resource and define who can access that resource and what actions they can perform. For example, an S3 bucket policy can allow read access to objects from a specific IP range, or deny all public access to the bucket.</p>
<h3 id="heading-combining-resource-based-and-identity-based-policies">Combining resource-based and identity-based policies</h3>
<p>Resource-based policies work in conjunction with identity-based policies to provide comprehensive access control. When an IAM user or role tries to access a resource, AWS evaluates both the identity-based policies attached to the user/role and the resource-based policy attached to the resource. Access is granted only if both policies allow it.</p>
<h3 id="heading-vpc-endpoints-and-iam-policies">VPC endpoints and IAM policies</h3>
<p>VPC endpoints allow you to securely access AWS services from within your VPC, without traversing the public internet. You can use IAM policies to control access to VPC endpoints, ensuring that only authorized users or roles can access the services behind the endpoint.</p>
<h3 id="heading-securing-access-to-api-gateway-and-lambda">Securing access to API Gateway and Lambda</h3>
<p>API Gateway and Lambda are powerful tools for building serverless applications, but they also introduce new security challenges. Best practices for securing access to these services include:</p>
<ul>
<li><p>Use IAM roles to grant Lambda functions access to other AWS services</p>
</li>
<li><p>Implement OAuth or JWT authentication for APIs</p>
</li>
<li><p>Use API keys and usage plans to control access to APIs</p>
</li>
<li><p>Enable AWS WAF to protect against common web exploits</p>
</li>
</ul>
<h3 id="heading-protecting-sensitive-data-with-kms-and-iam">Protecting sensitive data with KMS and IAM</h3>
<p>AWS Key Management Service (KMS) allows you to encrypt your sensitive data using centrally managed keys. IAM policies can be used to control access to KMS keys, ensuring that only authorized users or roles can encrypt or decrypt data.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-centralized-iam-management-with-aws-organizations">Centralized IAM Management with AWS Organizations</h2>
<p>For organizations with multiple AWS accounts, managing IAM across all those accounts can be a challenge. That's where AWS Organizations comes in.</p>
<h3 id="heading-benefits-of-using-aws-organizations">Benefits of using AWS Organizations</h3>
<p>AWS Organizations allows you to centrally manage access across multiple accounts. You can create an organization, invite accounts to join, and then use Service Control Policies (SCPs) to enforce IAM policies across all accounts in the organization.</p>
<h3 id="heading-setting-up-an-organization-and-creating-member-accounts">Setting up an organization and creating member accounts</h3>
<p>To get started with AWS Organizations, you create an organization and invite existing accounts to join, or create new accounts directly within the organization. You can organize accounts into Organizational Units (OUs) to apply policies hierarchically.</p>
<h3 id="heading-implementing-service-control-policies-scps">Implementing Service Control Policies (SCPs)</h3>
<p>Service Control Policies are a powerful feature of AWS Organizations that allow you to centrally control what actions can be performed by IAM users and roles across all accounts in your organization. SCPs are similar to IAM policies, but they apply at the account level and can be used to enforce security best practices and compliance requirements.</p>
<h3 id="heading-delegating-access-across-accounts-with-iam-roles">Delegating access across accounts with IAM roles</h3>
<p>In addition to SCPs, AWS Organizations also simplifies cross-account access using IAM roles. You can create a role in a central account and grant access to users or roles in other accounts within the organization. This allows you to centrally manage permissions while still enabling teams to access the resources they need.</p>
<h3 id="heading-best-practices-for-aws-organizations">Best practices for AWS Organizations</h3>
<p>Some best practices for using AWS Organizations include:</p>
<ul>
<li><p>Use SCPs to enforce security best practices and compliance requirements</p>
</li>
<li><p>Implement a least privilege model, granting only the permissions necessary for each account</p>
</li>
<li><p>Use AWS CloudTrail to monitor IAM activity across all accounts</p>
</li>
<li><p>Regularly review and audit IAM policies and roles</p>
</li>
<li><p>Use automation tools like AWS CloudFormation to manage IAM resources consistently across accounts</p>
</li>
</ul>
<h2 id="heading-monitoring-and-auditing-iam-activity-with-aws-cloudtrail">Monitoring and Auditing IAM Activity with AWS CloudTrail</h2>
<p>Monitoring and auditing IAM activity is critical for detecting and responding to security incidents. AWS CloudTrail is a powerful tool for tracking IAM activity across your AWS accounts.</p>
<h3 id="heading-importance-of-monitoring-iam-events">Importance of monitoring IAM events</h3>
<p>By monitoring IAM events, you can detect suspicious activity like unauthorized access attempts, changes to IAM policies, or creation of new IAM users or roles. This allows you to quickly investigate and respond to potential security breaches.</p>
<h3 id="heading-using-aws-cloudtrail-to-track-iam-actions">Using AWS CloudTrail to track IAM actions</h3>
<p>AWS CloudTrail logs all API calls made to IAM, including who made the call, what actions were performed, and what resources were affected. You can use CloudTrail to create a complete audit trail of IAM activity in your account.</p>
<h3 id="heading-monitoring-iam-events-with-amazon-cloudwatch">Monitoring IAM events with Amazon CloudWatch</h3>
<p>In addition to CloudTrail, you can use Amazon CloudWatch to monitor IAM events in real-time. CloudWatch allows you to create alarms based on specific IAM events, like failed login attempts or changes to sensitive policies.</p>
<h3 id="heading-detecting-and-alerting-on-suspicious-iam-activity">Detecting and alerting on suspicious IAM activity</h3>
<p>By combining CloudTrail and CloudWatch, you can create a comprehensive monitoring and alerting system for IAM. Some best practices include:</p>
<ul>
<li><p>Create alarms for high-risk events like IAM policy changes or root account usage</p>
</li>
<li><p>Use CloudTrail Insights to detect unusual activity patterns</p>
</li>
<li><p>Integrate with SIEM tools like Splunk or AWS Security Hub for centralized monitoring</p>
</li>
</ul>
<h3 id="heading-conducting-regular-iam-audits-and-compliance-checks">Conducting regular IAM audits and compliance checks</h3>
<p>In addition to real-time monitoring, it's important to conduct regular IAM audits to ensure your policies and permissions are configured correctly and comply with your security and compliance requirements. Tools like AWS IAM Access Analyzer and AWS Config can help automate this process.</p>
<h2 id="heading-advanced-iam-security-features">Advanced IAM Security Features</h2>
<p>These are some more advanced features of AWS IAM, or some related services that will help you secure your AWS accounts and workloads.</p>
<h3 id="heading-iam-access-analyzer">IAM Access Analyzer</h3>
<p>AWS IAM Access Analyzer is a powerful tool for identifying unintended access to your AWS resources. It analyzes your IAM policies and resource-based policies to determine who has access to your resources and whether that access is intended.</p>
<p>IAM Access Analyzer can help you identify scenarios like:</p>
<ul>
<li><p>Public access to S3 buckets or other resources</p>
</li>
<li><p>Access granted to external AWS accounts</p>
</li>
<li><p>Overly permissive IAM policies</p>
</li>
</ul>
<p>By identifying these issues early, you can take corrective action before they lead to a security breach.</p>
<h3 id="heading-iam-permission-boundaries">IAM Permission Boundaries</h3>
<p>IAM Permission Boundaries are a way to limit the maximum permissions that can be granted to an IAM user or role. They're useful for scenarios like allowing developers to create their own IAM policies, but ensuring they can't grant themselves excessive permissions.</p>
<p>To implement a permission boundary, you create an IAM policy that defines the maximum permissions allowed, then attach that policy as a permission boundary to an IAM user or role. Any policies attached to the user or role are evaluated within the constraints of the permission boundary.</p>
<h3 id="heading-iam-policy-conditions">IAM Policy Conditions</h3>
<p>IAM Policy Conditions allow you to create more fine-grained access control policies based on specific attributes of a request, like the source IP address, time of day, or presence of multi-factor authentication.</p>
<p>Some examples of using IAM policy conditions include:</p>
<ul>
<li><p>Allowing access only during business hours</p>
</li>
<li><p>Requiring multi-factor authentication for sensitive actions</p>
</li>
<li><p>Restricting access to specific IP ranges or VPC endpoints</p>
</li>
</ul>
<h3 id="heading-iam-identity-center-for-aws-sso">IAM Identity Center for AWS SSO</h3>
<p>IAM Identity Center (formerly AWS Single Sign-On) is a centralized access management service that allows users to sign in once and access multiple AWS accounts and cloud applications.</p>
<p>With IAM Identity Center, you can create and manage user identities in a central directory, then assign permissions to those users across multiple AWS accounts. Users sign in once to the IAM Identity Center portal, then access their assigned accounts and applications without needing to manage separate credentials.</p>
<h3 id="heading-integrating-iam-identity-center-with-third-party-identity-providers">Integrating IAM Identity Center with third-party identity providers</h3>
<p>IAM Identity Center also allows you to integrate with third-party identity providers like Azure AD, Okta, or Ping Identity. This allows you to use your existing identity management system to control access to AWS, without needing to recreate user identities in IAM.</p>
<h2 id="heading-automating-iam-with-infrastructure-as-code-tools">Automating IAM with Infrastructure as Code Tools</h2>
<p>As your AWS environment grows, managing IAM policies and roles manually becomes increasingly difficult. That's where Infrastructure as Code (IaC) tools like AWS CloudFormation, Terraform, and the AWS CDK come in.</p>
<h3 id="heading-benefits-of-using-infrastructure-as-code-iac-for-iam">Benefits of using Infrastructure as Code (IaC) for IAM</h3>
<p>By defining your IAM resources as code, you can:</p>
<ul>
<li><p>Version control your IAM policies and roles</p>
</li>
<li><p>Automate the creation and updates of IAM resources</p>
</li>
<li><p>Ensure consistency across multiple AWS accounts and regions</p>
</li>
<li><p>Easily roll back changes if needed</p>
</li>
</ul>
<h3 id="heading-using-aws-cloudformation-to-manage-iam-resources">Using AWS CloudFormation to manage IAM resources</h3>
<p>AWS CloudFormation is a native AWS service that allows you to define your infrastructure as code using JSON or YAML templates. You can use CloudFormation to create and manage IAM users, groups, roles, and policies across multiple accounts and regions.</p>
<h3 id="heading-terraform-and-aws-cdk-for-iam-automation">Terraform and AWS CDK for IAM automation</h3>
<p>Terraform and the AWS Cloud Development Kit (CDK) are popular third-party IaC tools that support IAM resource management. Terraform uses a declarative language called HCL (HashiCorp Configuration Language) to define infrastructure resources, while the AWS CDK allows you to define infrastructure using familiar programming languages like JavaScript, TypeScript, Python, or Java.</p>
<h3 id="heading-best-practices-for-iam-automation-and-version-control">Best practices for IAM automation and version control</h3>
<p>When automating IAM with IaC tools, it's important to follow best practices like:</p>
<ul>
<li><p>Storing your IaC templates in a version control system like Git</p>
</li>
<li><p>Using separate AWS accounts for development, staging, and production environments</p>
</li>
<li><p>Implementing a code review process for IAM changes</p>
</li>
<li><p>Using tools like AWS CloudTrail and AWS Config to monitor and audit IAM changes</p>
</li>
</ul>
<p>By treating your IAM resources as code and following these best practices, you can ensure consistency, maintainability, and auditability of your IAM configuration.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>IAM is a critical component of securing your AWS environment, but it can be really complex and challenging to manage at scale. By following best practices like the <strong>principle of least privilege</strong>, <strong>using IAM roles for cross-account access</strong>, and implementing <strong>strong password policies and MFA</strong>, you can lay a solid foundation for your IAM strategy.</p>
<p>But to truly secure your accounts and environments, you need to go beyond the basics. Techniques like <strong>granular access control with policy conditions</strong>, <strong>resource-based policies</strong>, and <strong>permission boundaries</strong> allow you to implement fine-grained security policies that precisely control access to your resources. <strong>Centralized management with AWS Organizations</strong> and <strong>monitoring with CloudTrail and CloudWatch</strong> provide visibility and actionable data across your entire AWS environment.</p>
<p>As your AWS usage grows, <strong>automating IAM with Infrastructure as Code</strong> tools like CloudFormation, Terraform, and the AWS CDK becomes increasingly important. By defining your IAM resources as code and following best practices for version control and testing, you can ensure consistency and maintainability of your IAM configuration.</p>
<p>Securing your AWS environment is an ongoing process, not a one-time task. As you adopt new AWS services and your application requirements evolve, it's important to continually review and update your IAM policies to ensure they align with your security goals. Regular audits and compliance checks, along with automated monitoring and alerting, can help you stay on top of your IAM configuration and quickly detect and respond to potential issues.</p>
<p>By following the best practices and techniques outlined in this article, you can build a robust and secure IAM strategy that helps you protect your critical AWS resources and data. But don't stop here! Continue to explore and adopt new security services and features like AWS GuardDuty, AWS Security Hub, and AWS Secrets Manager to further strengthen your security posture.</p>
<p>Remember, security is a shared responsibility between AWS and you, the customer. By taking a proactive and layered approach to IAM and security, you can ensure that your AWS environment is protected against evolving threats and ready to support your business needs for years to come.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Disaster Recovery Strategies on AWS: Ensuring Business Continuity]]></title><description><![CDATA[We're now living in the world of immediate and always-on stuff, where even a few minutes of downtime can be a disaster for businesses. Customers expect 24/7 availability, and any interruption in service can lead to lost revenue, damaged reputation, a...]]></description><link>https://blog.guilleojeda.com/disaster-recovery-strategies-on-aws</link><guid isPermaLink="true">https://blog.guilleojeda.com/disaster-recovery-strategies-on-aws</guid><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Fri, 15 Mar 2024 01:07:48 GMT</pubDate><content:encoded><![CDATA[<p>We're now living in the world of immediate and always-on stuff, where even a few minutes of downtime can be a disaster for businesses. Customers expect 24/7 availability, and any interruption in service can lead to lost revenue, damaged reputation, and even legal consequences. That's where disaster recovery (DR) and business continuity planning come into play.</p>
<p>Disaster recovery is all about preparing for the worst-case scenarios—those unexpected events that can bring your systems to a halt. Whether it's a natural disaster, human error, or a cyber-attack (which is often also caused by human error), having a solid DR plan in place can make the difference between a minor hiccup and a catastrophic failure.</p>
<p>Amazon Web Services (AWS) offers a wide variety of services and features to help you build robust, resilient architectures that can continue operating in the event of a disaster. In this article, we'll explore the key concepts and strategies for implementing effective disaster recovery on AWS.</p>
<h2 id="heading-understanding-rto-and-rpo">Understanding RTO and RPO</h2>
<p>Before we dive into specific DR strategies, let's take a moment to define two critical metrics: <strong>Recovery Time Objective (RTO)</strong> and <strong>Recovery Point Objective (RPO)</strong>.</p>
<p>RTO is the maximum acceptable amount of time your systems can be down when a disaster occurs. In other words, it's the timeframe within which you need to restore your applications and data to avoid unacceptable consequences. For example, a financial trading platform might have an RTO of just a few minutes, while a less critical internal tool might have an RTO of several hours.</p>
<p>RPO, on the other hand, refers to the maximum acceptable amount of data loss your business can tolerate. It's determined by how frequently you take backups and how much data you're willing to lose in the event of a disaster. For instance, an e-commerce site might have an RPO of just a few seconds, meaning they can only afford to lose a very small amount of data, while a blog might be okay with losing a day's worth of content.</p>
<p>Your RTO and RPO will heavily influence your choice of DR strategies. The tighter your objectives, the more robust (and expensive) your DR solution will need to be. It's all about finding the right balance between cost and risk.</p>
<h2 id="heading-designing-a-highly-available-architecture-on-aws">Designing a Highly Available Architecture on AWS</h2>
<p>The first step before even thinking about disaster recovery is a highly available architecture. On AWS, that means leveraging multiple Availability Zones (AZs) to build redundancy and fault tolerance into your applications.</p>
<p>AWS operates a global network of data centers, grouped into regions and further subdivided into AZs. Each AZ is a fully isolated partition of the AWS infrastructure, with independent power, cooling, and networking. By deploying your applications across multiple AZs within a region, you can protect against failures at the data center level.</p>
<p>Of course, building a highly available architecture involves more than just spreading your resources across AZs. You'll also need to implement load balancing and auto-scaling to distribute traffic evenly and automatically adjust capacity based on demand. Services like Amazon EC2 Auto Scaling and Elastic Load Balancing make this easy to achieve.</p>
<p>But what if an entire region goes down? That's where multi-region architectures come into play. By replicating your data and applications across multiple AWS regions, you can ensure that even if an entire region becomes unavailable, your business can continue to operate from another location.</p>
<p>That is what we call Disaster Recovery.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-disaster-recovery-strategies-on-aws">Disaster Recovery Strategies on AWS</h2>
<p>Now that we've covered the basics of high availability, let's explore <a target="_blank" href="https://newsletter.simpleaws.dev/p/disaster-recovery-strategies-aws?utm_source=blog?utm_medium=hashnode">four common DR strategies</a> you can implement on AWS: backup and restore, pilot light, warm standby, and multi-site active-active.</p>
<h3 id="heading-backup-and-restore-strategy">Backup and Restore Strategy</h3>
<p>The backup and restore strategy is the most basic and cost-effective approach to DR on AWS. It involves taking regular backups of your data and storing them in a secure, durable location like Amazon S3. In the event of a disaster, you can restore your systems from the most recent backup. While simple, this strategy typically involves significant downtime, as you'll need to provision new infrastructure and restore your data before your applications can be brought back online. It's best suited for non-critical workloads with lenient RTO and RPO requirements.</p>
<h3 id="heading-pilot-light-strategy">Pilot Light Strategy</h3>
<p>The pilot light strategy involves keeping a minimal version of your environment running in a secondary region, ready to scale up quickly in the event of a disaster. Core components, like your database servers, are always on, but application servers are kept in a stopped state to minimize costs. When disaster strikes, you can quickly start up your application servers, scale them out to handle the full production load, and redirect traffic to the secondary region. This approach offers faster recovery times than the backup and restore strategy, but still involves some downtime.</p>
<h3 id="heading-warm-standby-strategy">Warm Standby Strategy</h3>
<p>The warm standby strategy takes the pilot light approach a step further. Instead of keeping your secondary environment in a minimal state, you maintain a scaled-down version of your full production environment in the secondary region, with all components running. In the event of a disaster, you can rapidly scale up the secondary environment to handle the full production load. This strategy provides even faster recovery times than the pilot light approach, but comes with higher ongoing costs.</p>
<h3 id="heading-multi-site-active-active-strategy">Multi-Site Active-Active Strategy</h3>
<p>The multi-site active-active strategy is the most comprehensive and expensive DR approach. It involves running your full production environment in multiple regions simultaneously, with each region serving traffic and replicating data in real-time. If one region fails, traffic is automatically routed to the other active region(s) without any interruption in service. This strategy provides the highest level of availability and the fastest recovery times, but also incurs the highest costs, as you're essentially running multiple copies of your entire infrastructure.</p>
<h2 id="heading-how-to-create-backups-on-aws">How to Create Backups on AWS</h2>
<p>Regardless of which DR strategy you choose, creating regular backups is a critical component of any DR plan. AWS offers several backup services and features to help you protect your data.</p>
<p>For Amazon EC2 instances, you can create point-in-time snapshots of your EBS volumes, which can be used to restore your instances to a previous state. You can automate the creation and management of EBS snapshots using AWS Backup, a fully managed backup service that simplifies the process of backing up your AWS resources.</p>
<p>For managed database services like Amazon RDS and Amazon DynamoDB, automated backups are typically enabled by default. You can also create manual snapshots for longer-term retention or to copy your backups to another region for DR purposes.</p>
<p>It's important to regularly test your backups to ensure they can be successfully restored in the event of a disaster. You should also consider implementing a backup retention policy to ensure you have the right balance of short-term and long-term backups to meet your RPO requirements.</p>
<h2 id="heading-replication-and-failover-strategies">Replication and Failover Strategies</h2>
<p>In addition to backups, replication and failover are key components of many DR strategies on AWS. By replicating your data and applications across multiple regions, you can ensure that even if an entire region becomes unavailable, your business can continue to operate from another location.</p>
<p>AWS offers several services and features to help you implement cross-region replication and failover. For example, you can use Amazon S3 Cross-Region Replication to automatically replicate objects across S3 buckets in different AWS regions. For databases, you can use Amazon RDS Read Replicas or Amazon Aurora Global Database to create cross-region read replicas that can be quickly promoted to standalone instances in the event of a disaster.</p>
<p>When it comes to failover, you'll need to consider both application-level and DNS-level strategies. At the application level, you can use services like Amazon Route 53 Application Recovery Controller to continuously monitor your application's health and automatically route traffic to healthy resources in the event of a failure.</p>
<p>For DNS failover, Amazon Route 53 offers a variety of routing policies that can help you direct traffic to the appropriate region based on factors like latency, geography, and resource health. By combining these strategies, you can create a robust, automated failover solution that minimizes downtime and ensures your applications remain available even in the face of a regional outage.</p>
<h2 id="heading-disaster-recovery-automation-and-testing">Disaster Recovery Automation and Testing</h2>
<p>Automation is key to implementing an effective DR strategy on AWS. By declaring your infrastructure as code using tools like AWS CloudFormation and Terraform, you can ensure that your DR environment can be quickly and consistently provisioned in the event of a disaster.</p>
<p><strong>Infrastructure as Code (IaC)</strong> not only speeds up the recovery process, but also reduces the risk of human error and ensures that your DR environment is always in a known, consistent state. You can use IaC templates to define everything from your network topology to your application configurations, making it easy to spin up an exact replica of your production environment in a secondary region.</p>
<p>Regular testing is also essential to ensuring the viability of your DR plan. You should schedule periodic DR drills to simulate different failure scenarios and validate that your recovery processes work as expected. These drills can help you identify gaps in your plan and areas for improvement, ensuring that you're always prepared for a real-world disaster.</p>
<h2 id="heading-chaos-engineering-on-aws">Chaos Engineering on AWS</h2>
<p>In addition to traditional DR testing, you may also want to consider implementing chaos engineering practices to proactively identify weaknesses in your systems. Chaos engineering involves intentionally injecting failures into your environment to test its resilience and uncover hidden vulnerabilities.</p>
<p>AWS offers a service called AWS Fault Injection Simulator (FIS) that makes it easy to perform controlled chaos experiments on your AWS workloads. With FIS, you can simulate a variety of failure scenarios, like EC2 instance terminations, API throttling, and network latency, and observe how your applications respond.</p>
<p>By regularly performing chaos experiments, you can build confidence in your systems' ability to withstand failures and identify opportunities for improvement before a real disaster strikes.</p>
<h2 id="heading-monitoring-and-alerting-for-disaster-recovery">Monitoring and Alerting for Disaster Recovery</h2>
<p>Effective monitoring and alerting are critical components of any DR strategy. You need to be able to quickly detect and respond to issues before they escalate into full-blown disasters.</p>
<p>AWS offers a range of monitoring and logging services, like Amazon CloudWatch and AWS X-Ray, that can help you gain visibility into the health and performance of your applications. CloudWatch allows you to collect and track metrics, collect and monitor log files, and set alarms that notify you when thresholds are breached. X-Ray helps you analyze and debug distributed applications, providing insights into how your services are interacting and performing.</p>
<p>In addition to these services, you should also consider implementing a robust alerting strategy using Amazon Simple Notification Service (SNS). With SNS, you can send notifications via email, SMS, or even trigger automated remediation actions when specific events occur or thresholds are crossed.</p>
<p>By combining comprehensive monitoring with proactive alerting, you can ensure that you're always aware of the state of your environment and can quickly respond to any issues that arise.</p>
<h2 id="heading-cost-optimization-for-disaster-recovery">Cost Optimization for Disaster Recovery</h2>
<p>Implementing a comprehensive DR strategy can be expensive, especially if you're maintaining a fully replicated environment in a secondary region. However, there are several strategies you can use to optimize your costs without compromising on your DR objectives.</p>
<p>One approach is to leverage AWS cost-saving features like Reserved Instances and Spot Instances for your DR environment. By purchasing Reserved Instances, you can significantly reduce your EC2 costs compared to On-Demand pricing. Spot Instances allow you to bid on spare EC2 capacity at steep discounts, which can be ideal for non-critical DR workloads.</p>
<p>Another strategy is to tiered approach to DR, using different strategies for different parts of your application stack based on their criticality and recovery requirements. For example, you might use a multi-site active-active approach for your most critical databases, but a pilot light approach for less critical application tiers.</p>
<p>Continuously monitoring and optimizing your DR costs is also important. You should regularly review your DR environment to identify any underutilized or unnecessary resources, and adjust your strategy accordingly. Tools like AWS Cost Explorer and AWS Budgets can help you track your spending and set alerts when you're approaching your budget limits.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Implementing an effective disaster recovery strategy on AWS requires careful planning, robust architecture, and regular testing and optimization. By leveraging the right mix of AWS services and features, you can create a DR solution that meets your business's unique requirements for availability, recovery time, and data protection.</p>
<p>To recap, the four main DR strategies you can implement on AWS are:</p>
<ul>
<li><p><strong>Backup and restore:</strong> Periodically backing up your data and resources, and restoring them in the event of a disaster.</p>
</li>
<li><p><strong>Pilot light:</strong> Maintaining a minimal version of your environment in a secondary region, ready to scale up when needed.</p>
</li>
<li><p><strong>Warm standby:</strong> Running a scaled-down version of your full environment in a secondary region, with the ability to quickly scale up to handle the full production load.</p>
</li>
<li><p><strong>Multi-site active-active:</strong> Running your full production environment simultaneously in multiple regions, with automatic failover between regions.</p>
</li>
</ul>
<p>Regardless of which strategy you choose, it's critical to regularly test and refine your DR plan to ensure it remains effective as your business evolves. By combining comprehensive monitoring, automated failover, and regular chaos engineering practices, you can build a resilient, highly available application that can weather any storm.</p>
<p>Remember, disaster recovery planning isn't a one-time exercise—it's an ongoing process that requires continuous improvement and optimization. By staying proactive and prepared, you can ensure that your business can continue to operate and thrive, no matter what challenges come your way.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Understanding Amazon S3 Pricing]]></title><description><![CDATA[What is Amazon S3
Amazon Simple Storage Service (S3) is an object storage service by AWS that can store any kind of information. S3 is known for its durability, availability, and scalability, and the fact that all of these features come out of the bo...]]></description><link>https://blog.guilleojeda.com/understanding-amazon-s3-pricing</link><guid isPermaLink="true">https://blog.guilleojeda.com/understanding-amazon-s3-pricing</guid><category><![CDATA[AWS]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[storage]]></category><category><![CDATA[cost-optimisation]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 01 Feb 2024 15:10:41 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1706800217472/570dbf16-bf6d-4a4f-9e8f-1368ff08202b.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-what-is-amazon-s3"><strong>What is Amazon S3</strong></h2>
<p>Amazon Simple Storage Service (S3) is an object storage service by AWS that can store any kind of information. S3 is known for its durability, availability, and scalability, and the fact that all of these features come out of the box makes S3 a go-to solution for a wide range of data storage needs.</p>
<p>In S3 users create 'buckets' – containers for data stored in the AWS cloud. Storing data in buckets serves various use cases, from website hosting to backup and recovery, data archiving, and big data analytics.</p>
<h2 id="heading-s3-storage-classes"><strong>S3 Storage Classes</strong></h2>
<p>Amazon S3 offers several storage classes designed for different use cases:</p>
<ol>
<li><p><strong>S3 Standard</strong>: For frequently accessed data. You're billed per storage and per request.</p>
</li>
<li><p><strong>S3 Standard-IA (Infrequent Access)</strong>: For data that is accessed less frequently but requires rapid access when needed. Lower fee per GB stored than Standard, but a higher fee per request.</p>
</li>
<li><p><strong>S3 One Zone-IA</strong>: Similar to Standard-IA, but data is stored in a single Availability Zone only, and it's also cheaper.</p>
</li>
<li><p><strong>S3 Express One Zone</strong>: High-performance storage for your most frequently accessed data.</p>
</li>
<li><p><strong>S3 Intelligent-Tiering</strong>: Automatically moves data between the Standard and Standard-IA tiers based on continuously evaluating your access patterns. Ideal for data with unknown or changing access patterns.</p>
</li>
<li><p><strong>S3 Glacier</strong>: For long-term archival. Very low storage cost, but retrieving data can take several hours and is even more expensive than Standard-IA.</p>
</li>
<li><p><strong>S3 Glacier Deep Archive</strong>: Amazon S3's lowest-cost storage class for long-term archiving where data retrieval times of 12 hours or more are acceptable.</p>
</li>
</ol>
<h2 id="heading-s3-pricing-explained">S3 Pricing Explained</h2>
<p>As mentioned above, the different storage classes have different prices. Here are the prices for each S3 storage class:</p>
<h3 id="heading-pricing-for-s3-standard"><strong>Pricing for S3 Standard</strong></h3>
<ul>
<li><p><strong>Storage:</strong> $0.023 per GB for the first 50 TB, $0.022 per GB for the next 450 TB, $0.021 per GB for storage over 500 TB.</p>
</li>
<li><p><strong>Access:</strong> $0.005 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.00 per GB</p>
</li>
<li><p><strong>Other charges:</strong> None</p>
</li>
</ul>
<h3 id="heading-pricing-for-s3-standard-ia-infrequent-access"><strong>Pricing for S3 Standard-IA (Infrequent Access)</strong></h3>
<ul>
<li><p><strong>Storage:</strong> $0.0125 per GB</p>
</li>
<li><p><strong>Access:</strong> $0.01 per 1000 PUT, COPY, POST, LIST requests. $0.001 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.01 per GB</p>
</li>
<li><p><strong>Other charges:</strong> $0.01 per Lifecycle Transition request</p>
</li>
</ul>
<h3 id="heading-pricing-for-s3-one-zone-ia"><strong>Pricing for S3 One Zone-IA</strong></h3>
<ul>
<li><p><strong>Storage:</strong> $0.01 per GB</p>
</li>
<li><p><strong>Access:</strong> $0.01 per 1000 PUT, COPY, POST, LIST requests. $0.001 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.01 per GB</p>
</li>
<li><p><strong>Other charges:</strong> $0.01 per Lifecycle Transition request</p>
</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h3 id="heading-pricing-for-s3-express-one-zone"><strong>Pricing for S3 Express One Zone</strong></h3>
<ul>
<li><p><strong>Storage:</strong> $0.16 per GB</p>
</li>
<li><p><strong>Access:</strong> $0.0025 per 1000 PUT, COPY, POST, LIST requests. $0.0002 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.00 per GB</p>
</li>
<li><p><strong>Other charges:</strong> None</p>
</li>
</ul>
<h3 id="heading-pricing-for-s3-intelligent-tiering"><strong>Pricing for S3 Intelligent-Tiering</strong></h3>
<ul>
<li><p><strong>Storage:</strong><br />  Frequent Access tier: $0.023 per GB for the first 50 TB, $0.022 per GB for the next 450 TB, $0.021 per GB for storage over 500 TB.<br />  Infrequent Access tier: $0.0125 per GB.<br />  Archive Instant Access tier: $0.004 per GB.</p>
</li>
<li><p><strong>Access:</strong> $0.005 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.00 per GB</p>
</li>
<li><p><strong>Other charges:</strong> $0.0025 per 1,000 objects</p>
</li>
</ul>
<h3 id="heading-pricing-for-s3-glacier"><strong>Pricing for S3 Glacier</strong></h3>
<ul>
<li><p><strong>Storage:</strong><br />  Instant Retrieval: $0.004 per GB<br />  Flexible Retrieval: $0.0036 per GB</p>
</li>
<li><p><strong>Access:</strong><br />  Instant Retrieval: $0.02 per 1000 PUT, COPY, POST, LIST requests. $0.01 per 1000 GET, SELECT requests.<br />  Flexible Retrieval: $0.03 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong><br />  Instant Retrieval: $0.03 per GB<br />  Flexible Retrieval: $0.03 per GB for Expedited, $0.01 per GB for Standard</p>
</li>
<li><p><strong>Other charges:</strong><br />  Instant Retrieval: $0.02 per Lifecycle Transition request<br />  Flexible Retrieval: $0.03 per Lifecycle Transition request</p>
</li>
</ul>
<h3 id="heading-pricing-for-s3-glacier-deep-archive"><strong>Pricing for S3 Glacier Deep Archive</strong></h3>
<ul>
<li><p><strong>Storage:</strong> $0.00099 per GB</p>
</li>
<li><p><strong>Access:</strong> $0.05 per 1000 PUT, COPY, POST, LIST requests. $0.0004 per 1000 GET, SELECT requests.</p>
</li>
<li><p><strong>Data Retrieval:</strong> $0.02 per GB for Standard, $0.0025 per GB for Bulk</p>
</li>
<li><p><strong>Other charges:</strong> $0.05 per Lifecycle Transition request</p>
</li>
</ul>
<h2 id="heading-pricing-examples-for-s3-storage-classes"><strong>Pricing Examples for S3 Storage Classes</strong></h2>
<p>To give you a clearer picture of how S3 pricing works, let's see some examples. For each example, assume the following:</p>
<ul>
<li><p><strong>Storage:</strong> 100 GB</p>
</li>
<li><p><strong>Access:</strong> 100,000 GET requests, 10,000 PUT requests</p>
</li>
<li><p><strong>Data Retrieval:</strong> 100 GB</p>
</li>
</ul>
<h3 id="heading-example-1-s3-standard-storage">Example 1: S3 Standard Storage</h3>
<ul>
<li><p>Storage Cost: $2.30</p>
</li>
<li><p>Access Cost: $90</p>
</li>
<li><p>Data Retrieval Cost: $0</p>
</li>
<li><p><strong>Total Cost: $92.30</strong></p>
</li>
</ul>
<h3 id="heading-example-2-s3-express-one-zone">Example 2: S3 Express One Zone</h3>
<ul>
<li><p>Storage Cost: $16</p>
</li>
<li><p>Access Cost: $200</p>
</li>
<li><p>Data Retrieval Cost: $0</p>
</li>
<li><p><strong>Total Cost: $216</strong></p>
</li>
</ul>
<h3 id="heading-example-3-s3-standard-ia">Example 3: S3 Standard-IA</h3>
<ul>
<li><p>Storage Cost: $1.25</p>
</li>
<li><p>Access Cost: $200</p>
</li>
<li><p>Data Retrieval Cost: $10</p>
</li>
<li><p><strong>Total Cost: $211.25</strong></p>
</li>
</ul>
<h2 id="heading-aws-s3-free-tier"><strong>AWS S3 Free Tier</strong></h2>
<p>AWS offers a free tier for S3, which includes:</p>
<ul>
<li><p>5 GB of Standard Storage</p>
</li>
<li><p>20,000 GET Requests</p>
</li>
<li><p>2,000 PUT, COPY, POST, or LIST Requests</p>
</li>
</ul>
<p>This free tier is a great way to start experimenting with S3 without incurring immediate costs. Also, for really small uses like MVPs you end up paying $0 initially, and your costs only grow as you acquire more users.</p>
<h2 id="heading-tips-for-optimizing-aws-s3-costs"><strong>Tips for Optimizing AWS S3 Costs</strong></h2>
<ol>
<li><p><strong>Understand Your Data Usage</strong>: Analyze your data access patterns to choose the most cost-effective storage class.</p>
</li>
<li><p><strong>Monitor Your S3 Billing</strong>: Regularly check your AWS billing dashboard to track your S3 usage and costs.</p>
</li>
<li><p><strong>Leverage S3 Lifecycle Policies</strong>: Automatically move or archive data to lower-cost storage classes.</p>
</li>
<li><p><strong>Use S3 Analytics</strong>: Monitor and analyze storage access patterns for cost optimization.</p>
</li>
</ol>
<p>The goal of this guide was to help you understand AWS S3 pricing. Now you're able to use the best storage classes for your use cases, minimizing cost while maintaining durability and availability.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Understanding AWS Lambda Pricing]]></title><description><![CDATA[In this article, we'll dive deep into the pricing structure of AWS Lambda, breaking down its components, and providing examples to help you understand how costs are calculated. We'll also discuss the AWS Lambda Free Tier and offer practical tips for ...]]></description><link>https://blog.guilleojeda.com/understanding-aws-lambda-pricing</link><guid isPermaLink="true">https://blog.guilleojeda.com/understanding-aws-lambda-pricing</guid><category><![CDATA[AWS]]></category><category><![CDATA[serverless]]></category><category><![CDATA[lambda]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Mon, 29 Jan 2024 21:56:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1706565343777/a9657e2c-d471-41bc-a52e-e76185ba7cb3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article, we'll dive deep into the pricing structure of AWS Lambda, breaking down its components, and providing examples to help you understand how costs are calculated. We'll also discuss the AWS Lambda Free Tier and offer practical tips for optimizing your Lambda usage to keep costs manageable.</p>
<h2 id="heading-what-is-aws-lambda"><strong>What is AWS Lambda?</strong></h2>
<p>AWS Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. This service is capable of executing code in various languages and is commonly used for applications such as web application backends, data processing, and real-time file processing.</p>
<h2 id="heading-how-aws-lambda-works"><strong>How AWS Lambda Works</strong></h2>
<ul>
<li><p><strong>Event-Driven Execution:</strong> AWS Lambda is designed to run code in response to triggers such as changes in data within AWS services (like S3 or DynamoDB), requests to an API Gateway, or direct invocations via SDKs.</p>
</li>
<li><p><strong>Automatic Scaling:</strong> The service scales automatically, executing code in parallel and handling each trigger individually.</p>
</li>
<li><p><strong>Flexible Resource Allocation:</strong> Compute power is allocated based on the memory configured for your function, ensuring efficient resource utilization.</p>
</li>
</ul>
<h2 id="heading-key-components-of-aws-lambda"><strong>Key Components of AWS Lambda</strong></h2>
<ul>
<li><p><strong>Lambda Functions:</strong> The core unit where your code resides, along with associated configuration information such as the function name, memory, and timeout settings.</p>
</li>
<li><p><strong>Event Sources:</strong> These are AWS services or custom sources that trigger your Lambda function.</p>
</li>
<li><p><strong>Logs and Monitoring:</strong> Integration with AWS CloudWatch ensures detailed monitoring and logging of your Lambda functions.</p>
</li>
<li><p><strong>Runtime Environments:</strong> Supports multiple programming languages and runtimes.</p>
</li>
</ul>
<h2 id="heading-understanding-aws-lambda-pricing"><strong>Understanding AWS Lambda Pricing</strong></h2>
<p>AWS Lambda's pricing is primarily based on two components: the number of requests your functions process and the compute time they consume. Understanding these components in detail, including their cost, is crucial for effectively managing your AWS Lambda expenses. Here's an expanded breakdown:</p>
<ol>
<li><p><strong>Requests:</strong></p>
<ul>
<li><p><strong>Cost:</strong> AWS Lambda charges $0.20 per 1 million requests.</p>
</li>
<li><p><strong>What It Means:</strong> Every time your function is triggered and executed, it counts as a request.</p>
</li>
</ul>
</li>
<li><p><strong>Compute Time:</strong></p>
<ul>
<li><p><strong>Cost:</strong> Compute time is charged at $0.00001667 for every GB-second used.</p>
</li>
<li><p><strong>Calculation:</strong> The cost is based on the amount of memory allocated to your function and the time it takes to execute.</p>
</li>
<li><p><strong>GB-Second:</strong> A GB-second is a measure that combines memory usage and execution time. If your function uses 512MB of memory and runs for 3 seconds, it consumes 1.5 GB-seconds (0.5 GB * 3 seconds).</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-aws-lambda-free-tier"><strong>AWS Lambda Free Tier</strong></h3>
<p>AWS offers a generous free tier for Lambda:</p>
<ul>
<li><p><strong>1 million free requests per month.</strong></p>
</li>
<li><p><strong>400,000 GB-seconds of compute time per month.</strong></p>
</li>
</ul>
<h3 id="heading-pricing-examples-for-aws-lambda"><strong>Pricing Examples for AWS Lambda</strong></h3>
<p>To illustrate how Lambda pricing works, let's consider a few examples:</p>
<ul>
<li><p><strong>Example 1: Low Frequency, Simple Function</strong></p>
<ul>
<li><p>Requests: 100,000 in a month</p>
</li>
<li><p>Duration: Each request runs for 500ms with 128MB memory allocation.</p>
</li>
<li><p>Total Cost: $0.02 for invocations + $0.1042 for execution time = <strong>$0.1242 / month</strong>.</p>
</li>
</ul>
</li>
<li><p><strong>Example 2: High Frequency, Complex Function</strong></p>
<ul>
<li><p>Requests: 10 million in a month</p>
</li>
<li><p>Duration: Each request runs for 800ms with 256MB memory allocation.</p>
</li>
<li><p>Total Cost: $2.00 for invocations + $33.34 for execution time = <strong>$35.34 / month</strong>.</p>
</li>
</ul>
</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-tips-for-optimizing-aws-lambda-costs"><strong>Tips for Optimizing AWS Lambda Costs</strong></h2>
<ul>
<li><p><strong>Monitor Function Invocations:</strong> Regularly review your Lambda function metrics through AWS CloudWatch to understand your usage patterns.</p>
</li>
<li><p><strong>Adjust Memory Allocation:</strong> Optimize the memory allocation for your functions to balance performance and cost.</p>
</li>
<li><p><strong>Reduce Execution Time:</strong> Optimize your code to run faster, which directly reduces the compute time cost.</p>
</li>
<li><p><strong>Regularly Review Your Architecture:</strong> As your application evolves, continually reassess whether your use of Lambda aligns with your operational requirements and cost objectives.</p>
</li>
<li><p><strong>Leverage Free Tier:</strong> Make the most out of the AWS Lambda Free Tier, especially for development and testing purposes.</p>
</li>
</ul>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>AWS Lambda offers a flexible, cost-effective solution for running code in response to events. By understanding its pricing model and effectively managing your usage, you can leverage Lambda to build scalable, efficient applications without worrying about infrastructure management.</p>
<p>The goal of this guide is to help you gain a better understanding of AWS Lambda's pricing structure, enabling you to use this fantastic service more efficiently while keeping your AWS costs manageable.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Disaster Recovery and Business Continuity on AWS]]></title><description><![CDATA[Imagine this scenario: You successfully replicated your data to another region, so if your AWS region fails you can still access the data. However, all your servers are still down! You'd like to continue operating even in the event of a disaster.
Dis...]]></description><link>https://blog.guilleojeda.com/disaster-recovery-strategies-aws</link><guid isPermaLink="true">https://blog.guilleojeda.com/disaster-recovery-strategies-aws</guid><category><![CDATA[AWS]]></category><category><![CDATA[Disaster recovery]]></category><category><![CDATA[architecture]]></category><category><![CDATA[cloud architecture]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Tue, 05 Dec 2023 19:41:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1701805223934/1250dccb-4d9b-4525-bb03-aadca69ecad1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Imagine this scenario: You successfully <a target="_blank" href="https://newsletter.simpleaws.dev/p/data-loss-replication-disaster-recovery-aws?utm_source=blog&amp;utm_medium=hashnode">replicated your data to another region</a>, so if your AWS region fails you can still access the data. However, all your servers are still down! You'd like to continue operating even in the event of a disaster.</p>
<h2 id="heading-disaster-recovery-and-business-continuity">Disaster Recovery and Business Continuity</h2>
<p>Disasters are events that cause critical damage to our ability to operate as a business. Consider an earthquake near your datacenter (or the ones you're using in AWS), or a flood in that city (this happened to GCP in Paris in the second half of 2023). It follows that Business Continuity is the ability to continue operating (or recovering really fast) in the event of a Disaster. The big question is: <strong>How do we do that?</strong></p>
<p>First, let's understand what recovering looks like, and how much data and time can we lose (yes, we lose both) in the process. There are two objectives that we need to set:</p>
<h2 id="heading-recovery-point-objective-rpo">Recovery Point Objective (RPO)</h2>
<p>The RPO is the maximum time that passes between when the data is written to the primary storage and when it's written to the backup. For periodic backups, RPO is equal to the time between backups. For example, if you take a snapshot of your database every 12 hours, your RPO is 12 hours. For continuous replication, the RPO is equal to the replication delay. For example, if you continuously replicate data from the primary storage to a secondary one, the RPO is the delay in that replication.</p>
<p>Data that hasn't yet been written to the backup won't be available in the event of a disaster, so you want your RPO to be as small as possible. However, minimizing it may require adopting new technologies, which means effort and money. Sometimes it's worth it, sometimes it isn't.</p>
<p>Different data may require different RPOs. Since the easiness of achieving a low RPO mostly depends on what technologies you use, the decision of what the RPO is for a specific set of data should be considered when selecting where to store it.</p>
<h2 id="heading-recovery-time-objective-rto">Recovery Time Objective (RTO)</h2>
<p>The RTO is the maximum time that can pass from when a failure occurs to when you're operational again. The thing that will have the most impact on RTO is your disaster recovery strategy, which we'll see a bit further down this article. Different technologies will let you reduce the RTO within the same DR strategy, and a technology change may be a good way to reduce RTO without significantly increasing costs.</p>
<h2 id="heading-stages-of-a-disaster-recovery-process">Stages of a Disaster Recovery Process</h2>
<p>These are the three stages that a disaster recovery process goes through, always in this order.</p>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701800077228/4ce6d398-d864-4ac0-85ac-bfd307f32b8c.png" alt="Stages of a Disaster Recovery Process. Source" class="image--center mx-auto" /></a></p>
<h3 id="heading-detect">Detect</h3>
<p>Detection is the phase between when the failure actually occurs and when you start doing something about it. The absolute worst way to learn about a failure is from a customer, so detection should be the first thing you automate. The easiest way to do so is through a health check, which is a sample request sent periodically (e.g. every 30 seconds) to your servers. For example, Application Load Balancer implements this to detect whether targets in a target group are healthy, and can raise a CloudWatch Alarm if it has no healthy targets. You can connect that alarm to SNS to receive an email when that happens, and you'd have automated detection.</p>
<h3 id="heading-escalate-and-declare">Escalate and Declare</h3>
<p>This is the phase from when the first person is notified about an event 🔥 and when the alarm 🚨 sounds and everyone is called to battle stations 🚒. It may involve manually verifying something, or it may be entirely automated. In many cases it happens after a few corrective actions have been attempted, such as rolling back a deployment.</p>
<h3 id="heading-restore">Restore</h3>
<p>These are the steps necessary to get a system back online. It may be the old system that we're repairing, or it may be a new copy that we're preparing. It usually involves one or several automated steps, and in some events manual intervention is needed. It ends when the system is capable of serving production traffic.</p>
<h3 id="heading-fail-over">Fail over</h3>
<p>Once we have a live system capable of serving production traffic, we need to send traffic to it. It sounds trivial, but there are several factors that make it worth being a stage on its own:</p>
<ul>
<li><p>You usually want to do it gradually, to avoid crashing the new system</p>
</li>
<li><p>It may not happen instantly (for example, DNS propagation)</p>
</li>
<li><p>Sometimes this stage is triggered manually</p>
</li>
<li><p>You need to verify that it happened</p>
</li>
<li><p>You continue monitoring afterward</p>
</li>
</ul>
<h2 id="heading-disaster-recovery-strategies-on-aws">Disaster Recovery Strategies on AWS</h2>
<p>The two obvious solutions to disaster recovery are:</p>
<ul>
<li><p><a target="_blank" href="https://newsletter.simpleaws.dev/p/data-loss-replication-disaster-recovery-aws?utm_source=blog&amp;utm_medium=hashnode">Backing up data to another region</a> and re-creating the entire system</p>
</li>
<li><p>Continuously running the system in two regions</p>
</li>
</ul>
<p>Both work, but they're not the only ones. They're actually the two extremes of a spectrum:</p>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701800306748/f19499f1-1be2-4591-bcd2-3d959432870d.png" alt="Disaster Recovery Strategies." class="image--center mx-auto" /></a></p>
<h2 id="heading-backup-and-restore">Backup and Restore</h2>
<p>This is the simplest strategy, and the playbook is:</p>
<ul>
<li><p>Before an event (and continuously):</p>
<ul>
<li>Back up all your data to a separate AWS region, which we call the DR region</li>
</ul>
</li>
<li><p>When an event happens:</p>
<ul>
<li><p>Restore the data stores from the backups</p>
</li>
<li><p>Re-create the infrastructure from scratch</p>
</li>
<li><p>Fail over to the new infrastructure</p>
</li>
</ul>
</li>
</ul>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701800401214/0828cb84-6867-420a-9895-b4d006d37311.png" alt="Backup and Restore." class="image--center mx-auto" /></a></p>
<p>It's by far the cheapest, all you need to pay for are the backups and any other regional resources that you need to operate (e.g. KMS keys used to encrypt data). When a disaster happens, you restore from the backups and re-create everything.</p>
<p>I'm being purposefully broad when I say "re-create everything". I bet your infrastructure took you a long time to create. How fast can you re-create it? Can you even do it in hours or a few days, if you can't look at how you did it the first time? (Remember the original region is down).</p>
<p>The answer, of course, is Infrastructure as Code. It will let you launch a new stack of your infrastructure with little effort and little margin for error. That's why we (and by we I mean anyone who knows what they're doing with cloud infrastructure) insist so much on IaC.</p>
<p>As you're setting up your infrastructure as code, don't forget about supporting resources. For example, if your CI/CD Pipeline runs in a single AWS Region (e.g. you're using CodePipeline), you'll need to be ready to deploy it to the new region along with your production infrastructure. Other common supporting resources are values stored in Secrets Manager or SSM Parameter Store, KMS keys, VPC Endpoints, and CloudWatch Alarms configurations.</p>
<p>You can define all your infrastructure as code, but creating the new copy from your templates usually requires some manual actions. You need to document everything, so you're clear on what's the correct order for the different actions, what parameters to use, common errors and how to avoid or fix them, etc. If you have all of your infrastructure defined as code, this documentation won't be really large. However, it's still super important.</p>
<p>Finally, test everything. Don't just assume that it'll work, or you'll find out that it doesn't right in the middle of a disaster. Run periodic tests for your Disaster Recovery plan, keep the code and the documentation up to date, and keep yourself and your teams sharp.</p>
<h2 id="heading-pilot-light">Pilot Light</h2>
<p>With Backup and Restore you need to create a lot of things from scratch, which takes time. Even if you cut down all the manual processes, you might spend several hours staring at your terminal or the CloudFormation console waiting for everything to create.</p>
<p>What's more, most of these resources aren't even that expensive! Things like an Auto Scaling Group are free (without counting the EC2 instances), an Elastic Load Balancer costs only $23/month, and VPC and subnets are free. The largest portion of your costs come from the actual capacity that you use: a large number of EC2 instances, DynamoDB tables with a high capacity, etc. But since most of them are scalable, you could keep all the scaffolding set up with capacity scaled to 0, and scale up in the event of a disaster, right?</p>
<p>That's the idea behind Pilot Light, and this is the basic playbook:</p>
<ul>
<li><p>Before an event (and continuously):</p>
<ul>
<li><p>Continuously replicate all your data to a separate AWS region, which we call the DR region</p>
</li>
<li><p>Set up your infrastructure in the DR region, with capacity at 0</p>
</li>
</ul>
</li>
<li><p>When an event happens:</p>
<ul>
<li><p>Scale up the infrastructure in the DR region</p>
</li>
<li><p>Fail over to the DR region</p>
</li>
</ul>
</li>
</ul>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701800451975/2bcad7b8-c5f7-49c2-b579-01f536945368.png" alt="Pilot Light." class="image--center mx-auto" /></a></p>
<p>One of the things that takes the longest time to create is data stores from snapshots. For that reason, the prescriptive advice (though not a strict requirement) for Pilot Light is to keep data stores functioning, instead of just keeping the backups and restoring from them in a disaster. It is more expensive though.</p>
<p>Since scaling can be done automatically, the Restore stage is very easy to automate entirely when using Pilot Light. Also, since the scaling times are much shorter than creating everything from scratch, the impact of automating all manual operations will be much higher, and the resulting RTO much lower than with Backup and Restore.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-warm-standby">Warm Standby</h2>
<p>The problem with Pilot Light is that, before it scales, it cannot serve any traffic at all. It works just like the pilot light in a home heater: a small flame that doesn't produce any noticeable heat, but is used to light up the main burner much faster. It's a great strategy, and your users will appreciate that the service interruption is brief, in the order of minutes. But what if you need to serve at least those users nearly immediately?</p>
<p>Warm Standby uses the same idea as Pilot Light, but instead of remaining at 0 capacity, it keeps some capacity available. That way, if there is a disaster you can fail over immediately and start serving a subset of users, while the rest of them wait until your infrastructure in the DR region scales up to meet the entire production demand.</p>
<p>Here's the playbook:</p>
<ul>
<li><p>Before an event (and continuously):</p>
<ul>
<li><p>Continuously replicate all your data to a separate AWS region, which we call the DR region</p>
</li>
<li><p>Set up your infrastructure in the DR region, with capacity at a percentage greater than 0</p>
</li>
</ul>
</li>
<li><p>When an event happens:</p>
<ul>
<li><p>Reroute a portion of the traffic to the DR region</p>
</li>
<li><p>Scale up the infrastructure</p>
</li>
<li><p>Reroute the rest of the traffic</p>
</li>
</ul>
</li>
</ul>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701800543972/5e0ccdb1-cfa6-4fdf-b495-73688bad7efd.png" alt="Warm Standby." class="image--center mx-auto" /></a></p>
<p>What portion of the traffic you reroute depends on how much capacity you maintain "hot" (i.e. available). This lets you do some interesting things, like setting up priorities where traffic for some critical services is rerouted and served immediately, or even for some premium users.</p>
<p>It also presents a challenge: How much infrastructure do you keep hot in your DR region? It could be a fixed number like 2 EC2 instances, or you could dynamically adjust this to 20% of the capacity of the primary region (just don't accidentally set it to 0 when the primary region fails!).</p>
<p>You'd think dynamically communicating to the DR region the current capacity or load of the primary region would be too problematic to bother with. But you should be doing it anyway! When a disaster occurs and you begin scaling up your Pilot Light or Warm Standby infrastructure, you don't want to go through all the hoops of scaling slowly from 0 or low capacity to medium, to high, to maximum. You'd rather go from wherever you are directly to 100% of the capacity you need, be it 30 EC2 instances, 4000 DynamoDB WCUs, or whatever service you're using. To do that, you need to know how much is 100%, or in other words, how much capacity the primary region was running on before it went down. Remember that once it's down you can't go check! To solve that, back up the capacity metrics to the DR region. And once you have them, it's trivial to dynamically adjust your warm standby's capacity.</p>
<p>You can pick any number or percentage that you want, and it's really a business decision, not a technical one. Just keep in mind that if you pick 0 you're actually using a Pilot Light strategy, and if you pick 100% it's a variation of Warm Standby called Hot Standby, where you don't need to wait until infrastructure scales before rerouting all the traffic.</p>
<p>An important aspect that Warm Standby introduces is the fact that all three strategies that we've seen so far are active/passive, meaning that one region (the active one) serves traffic, while the other region (the DR one, which is passive) doesn't receive any traffic. With Backup and Restore and with Pilot Light that should be obvious, since they're not able to serve any traffic. Warm Standby is able to serve some traffic, and Hot Standby is able to serve the entirety of the traffic. But even then, they don't get any traffic, and the DR region is passive.</p>
<p>The reason for this is that, if you allow your DR region to write data while you're using the primary region (i.e. while it isn't down), then you need to deal with distributed databases with multiple writers, which is much harder than a single writer and multiple readers. Some managed services handle this very well, but even then there are implications that might affect your application. For example, DynamoDB Global Tables handle writes in any region where the global table is set up, but they resolve conflicts with a last-writer-wins reconciliation strategy, where if two regions receive write operations for the same item at the same time (i.e. within the replication delay window), the one that was written last is the one that sticks. Not a bad solution, but you don't want to overcomplicate things if you don't have to.</p>
<h2 id="heading-multi-site-activeactive">Multi-site Active/Active</h2>
<p><a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1701804821900/e470103f-b509-466d-a108-05e2d3bd5761.png" alt="Multi-site Active/Active." class="image--center mx-auto" /></a></p>
<p>In active/passive configurations, only one region serves traffic. Active/active spreads the traffic across both regions in normal operation conditions (i.e. when there's no disaster). As mentioned in the previous paragraph, this introduces a few problems.</p>
<p>The main problem is the read/write pattern that you'll use. Distributed data stores with multiple write nodes can experience "<strong>contention</strong>", a term that means everything is slowed down because multiple nodes are trying to access the same data, and they need to wait for the others so they don't cause inconsistencies. Contention is one of the reasons why databases are hard.</p>
<p>Another problem is that you're effectively managing two identical but separate infrastructures. Suddenly it's not just a group of instances plus one of everything else (Load Balancer, VPC, etc), but two of everything.</p>
<p>You also need to duplicate any configuration resources, such as Lambda functions that perform automations, SSM documents, SNS topics that generate alerts, etc.</p>
<p>Finally, instead of using the same value for "region" in all your code and configurations, you need to use two values, and use the correct one in every case. That's more complexity, more work, more cognitive load, and more chances of mistakes or slip ups.</p>
<p>Overall, Multi-Site Active/Active is much harder to manage than Warm Standby, but the advantage is that losing a region feels like losing an AZ when you're running a Highly Available workload: You just lose a bit of capacity, maybe fail over a couple of things, but overall everything keeps running smoothly.</p>
<h2 id="heading-tips-for-effective-disaster-recovery-on-aws">Tips for Effective Disaster Recovery on AWS</h2>
<h3 id="heading-decide-on-a-disaster-recovery-strategy">Decide on a Disaster Recovery Strategy</h3>
<p>You can choose freely between any of the four strategies outlined on this article, or you can even choose not to do anything in the event of a disaster. There are no wrong answers, there's only tradeoffs.</p>
<p>To pick the best strategy for you:</p>
<ul>
<li><p>Calculate how much money you'd lose per minute of downtime</p>
</li>
<li><p>If there are hits to your brand image, factor them in as well</p>
</li>
<li><p>Estimate how often these outages are likely to occur</p>
</li>
<li><p>Calculate how much each DR strategy would cost</p>
</li>
<li><p>Determine your RTO for each DR strategy</p>
</li>
<li><p>Plug everything into your calculator</p>
</li>
<li><p>Make an informed decision</p>
</li>
</ul>
<p>"I'd rather be offline for 24 hours once every year and lose $2.000 than increase my anual AWS expenses by $10.000 to reduce that downtime" is a perfectly valid and reasonable decision, but only if you've actually run the numbers and made it consciously.</p>
<h3 id="heading-improve-your-detection">Improve Your Detection</h3>
<p>The longer you wait to declare an outage, the longer your users have to wait until the service is restored. On the other hand, a false positive (where you declare an outage when there isn't one) will cause you to route traffic away from a region that's working, and your users will suffer from an outage that isn't there.</p>
<p>Improving the granularity of your metrics will let you detect anomalies faster. Cross-referencing multiple metrics will reduce your false positives without increasing your detection time. Additionally, consider partial outages, how to differentiate them from total outages, and what the response should be.</p>
<h3 id="heading-practice-practice-practice">Practice, Practice, Practice</h3>
<p>As with any complex procedure, there's a high probability that something goes wrong. When would you rather find out about it, on regular business hours when you're relaxed and awake, or at 3 am with your boss on the phone yelling about production being down and the backups not working?</p>
<p>Disaster Recovery involves software and procedures, and as with any software or procedures, you need to test them both. Run periodic disaster recovery drills, just like fire drills but for the prod environment. As the Google SRE book says: "<a target="_blank" href="https://sre.google/sre-book/managing-incidents/">If you haven’t gamed out your response to potential incidents in advance, principled incident management can go out the window in real-life situations.</a>"</p>
<hr />
<h2 id="heading-recommended-tools-and-resources-for-disaster-recovery"><strong>Recommended Tools and Resources for Disaster Recovery</strong></h2>
<p>One of the best things you can read on Disaster Recovery is the <a target="_blank" href="https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html">AWS whitepaper about Disaster Recovery</a>. In fact, it's where I took all the images from.</p>
<p>Another fantastic read is the chapter about <a target="_blank" href="https://sre.google/sre-book/managing-incidents/">Managing incidents</a> from the <a target="_blank" href="https://sre.google/sre-book/table-of-contents/">Site Reliability Engineering book</a> (by Google). If you haven't read the whole book, you might want to do so, but chapters stand independently so you can read just this one.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[DynamoDB Transactions: An E-Commerce with Amazon DynamoDB]]></title><description><![CDATA[We're building an e-commerce app with DynamoDB for the database, pretty similar to the one we built for the DynamoDB Database Design article. No need to go read that issue (though I think it came up great), here's how our database works:

Customers a...]]></description><link>https://blog.guilleojeda.com/dynamodb-transactions</link><guid isPermaLink="true">https://blog.guilleojeda.com/dynamodb-transactions</guid><category><![CDATA[AWS]]></category><category><![CDATA[DynamoDB]]></category><category><![CDATA[Amazon Web Services]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 09 Nov 2023 18:42:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1699554526559/bb762d4f-e9da-4c8b-82b7-9d1f4b5130e7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're building an e-commerce app with DynamoDB for the database, pretty similar to the one we built for the <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-database-design?utm_source=blog&amp;utm_medium=hashnode">DynamoDB Database Design article</a>. No need to go read that issue (though I think it came up great), here's how our database works:</p>
<ul>
<li><p>Customers are stored with a Customer ID starting with c# (for example c#123) as the PK and SK.</p>
</li>
<li><p>Products are stored with a Product ID starting with p# (for example p#123) as the PK and SK, and with an attribute of type number called 'stock', which contains the available stock.</p>
</li>
<li><p>Orders are stored with an Order ID starting with o# (for example o#123) for the PK and the Product ID as the SK.</p>
</li>
<li><p>When an item is purchased, we need to check that the Product is in stock, decrease the stock by 1 and create a new Order.</p>
</li>
<li><p>Payment, shipping and any other concerns are magically handled by the power of "that's out of scope for this issue" and "it's left as an exercise for the reader".</p>
</li>
</ul>
<p>There are more attributes in all entities, but let's ignore them.</p>
<p>We're going to use the following AWS services:</p>
<ul>
<li><strong>DynamoDB:</strong> A NoSQL database that supports <a target="_blank" href="https://en.wikipedia.org/wiki/ACID">ACID transactions</a>, just like any SQL-based database.</li>
</ul>
<h2 id="heading-before-implementing-dynamodb-transactions"><strong>Before Implementing DynamoDB Transactions</strong></h2>
<p>We need to read the value of stock and update it atomically. <strong>Atomicity</strong> is a property of a set of operations, where that set of operations can't be divided: it's either applied in full, or not at all. If we just ran the <code>GetItem</code> and <code>PutItem</code> actions separately, we could have a case where two customers are buying the last item in stock for that product, our scalable backend processes both requests simultaneously, and the events go down like this:</p>
<ol>
<li><p>Customer123 clicks Buy</p>
</li>
<li><p>Customer456 clicks Buy</p>
</li>
<li><p>Instance1 receives request from Customer123</p>
</li>
<li><p>Instance1 executes GetItem for Product111, receives a stock value of 1, continues with the purchase</p>
</li>
<li><p>Instance2 receives request from Customer456</p>
</li>
<li><p>Instance2 executes GetItem for Product111, receives a stock value of 1, continues with the purchase</p>
</li>
<li><p>Instance1 executes PutItem for Product111, sets stock to 0</p>
</li>
<li><p>Instance2 executes PutItem for Product111, sets stock to 0</p>
</li>
<li><p>Instance1 executes PutItem for Order0046</p>
</li>
<li><p>Instance1 receives a success, returns a success to the frontend.</p>
</li>
<li><p>Instance2 executes PutItem for Order0047</p>
</li>
<li><p>Instance2 receives a success, returns a success to the frontend.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1699554086075/9b7b293a-7113-4b2a-8c9c-c4b74278483e.png" alt="The process without transactions" class="image--center mx-auto" /></p>
<p>The data doesn't look corrupted, right? Stock for Product111 is 0 (it could end up being -1, depends on how you write the code), both orders are created, you received the money for both orders (out of scope for this issue), and both customers are happily awaiting their product. You go to the warehouse to dispatch both products, and find that you only have one in stock. Where did things go wrong?</p>
<h2 id="heading-steps-to-implement-dynamodb-transactions"><strong>Steps to Implement DynamoDB Transactions</strong></h2>
<p>The problem is that steps 4 and 7 were executed separately, and Instance2 got to read the stock of Product111 (step 6) in between them, and made the decision to continue with the purchase based on a value that hadn't been updated yet, but should have. Steps 4 and 7 need to happen atomically, in a transaction.</p>
<h3 id="heading-install-the-aws-sdk">Install the AWS SDK</h3>
<p>First, install the packages from the <a target="_blank" href="https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/">AWS SDK V3 for JavaScript</a>:</p>
<pre><code class="lang-bash">npm install @aws-sdk/client-dynamodb @aws-sdk/lib-dynamodb
</code></pre>
<h3 id="heading-update-the-code-to-use-transactions">Update the Code to Use Transactions</h3>
<p>This is the code in Node.js to run the steps as a transaction (you should add this to the code imaginary you already has for the service):</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> { DynamoDBClient } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'@aws-sdk/client-dynamodb'</span>);
<span class="hljs-keyword">const</span> { DynamoDBDocumentClient } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'@aws-sdk/lib-dynamodb'</span>);

<span class="hljs-keyword">const</span> dynamoDBClient = <span class="hljs-keyword">new</span> DynamoDBClient({ <span class="hljs-attr">region</span>: <span class="hljs-string">'us-east-1'</span> });
<span class="hljs-keyword">const</span> dynamodb = DynamoDBDocumentClient.from(dynamoDBClient);

<span class="hljs-comment">//The code imaginary you already has</span>

<span class="hljs-comment">//This is just some filler code to make this example valid. Imaginary you should already have this solved</span>
<span class="hljs-keyword">const</span> newOrderId = <span class="hljs-string">'o#123'</span> <span class="hljs-comment">//Must be unique</span>
<span class="hljs-keyword">const</span> productId = <span class="hljs-string">'p#111'</span> <span class="hljs-comment">//Comes in the request</span>
<span class="hljs-keyword">const</span> customerId = <span class="hljs-string">'c#123'</span> <span class="hljs-comment">//Comes in the request</span>

<span class="hljs-keyword">const</span> transactItems = {
  <span class="hljs-attr">TransactItems</span>: [
    {
      <span class="hljs-attr">ConditionCheck</span>: {
        <span class="hljs-attr">TableName</span>: <span class="hljs-string">'SimpleAwsEcommerce'</span>,
        <span class="hljs-attr">Key</span>: { <span class="hljs-attr">id</span>: productId },
        <span class="hljs-attr">ConditionExpression</span>: <span class="hljs-string">'stock &gt; :zero'</span>,
        <span class="hljs-attr">ExpressionAttributeValues</span>: {
          <span class="hljs-string">':zero'</span>: <span class="hljs-number">0</span>
        }
      }
    },
    {
      <span class="hljs-attr">Update</span>: {
        <span class="hljs-attr">TableName</span>: <span class="hljs-string">'SimpleAwsEcommerce'</span>,
        <span class="hljs-attr">Key</span>: { <span class="hljs-attr">id</span>: productId },
        <span class="hljs-attr">UpdateExpression</span>: <span class="hljs-string">'SET stock = stock - :one'</span>,
        <span class="hljs-attr">ExpressionAttributeValues</span>: {
          <span class="hljs-string">':one'</span>: <span class="hljs-number">1</span>
        }
      }
    },
    {
      <span class="hljs-attr">Put</span>: {
        <span class="hljs-attr">TableName</span>: <span class="hljs-string">'SimpleAwsEcommerce'</span>,
        <span class="hljs-attr">Item</span>: {
          <span class="hljs-attr">id</span>: newOrderId,
          <span class="hljs-attr">customerId</span>: customerId,
          <span class="hljs-attr">productId</span>: productId
        }
      }
    }
  ]
};

<span class="hljs-keyword">const</span> executeTransaction = <span class="hljs-keyword">async</span> () =&gt; {
  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> data = <span class="hljs-keyword">await</span> dynamodb.transactWrite(transactItems);
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'Transaction succeeded:'</span>, <span class="hljs-built_in">JSON</span>.stringify(data, <span class="hljs-literal">null</span>, <span class="hljs-number">2</span>));
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Transaction failed:'</span>, <span class="hljs-built_in">JSON</span>.stringify(error, <span class="hljs-literal">null</span>, <span class="hljs-number">2</span>));
  }
};

executeTransaction();

<span class="hljs-comment">//Rest of the code imaginary you already has</span>
</code></pre>
<h2 id="heading-after-implementing-dynamodb-transactions"><strong>After Implementing DynamoDB Transactions</strong></h2>
<p>Here's how things may happen with these changes, if both customers click Buy at the same time:</p>
<ol>
<li><p>Customer123 clicks Buy</p>
</li>
<li><p>Customer456 clicks Buy</p>
</li>
<li><p>Instance1 receives request from Customer123</p>
</li>
<li><p>Instance2 receives request from Customer456</p>
</li>
<li><p>Instance1 executes a transaction:</p>
<ol>
<li><p>ConditionCheck for Product111, stock is greater than 0 (actual value is 1)</p>
</li>
<li><p>PutItem for Product111, set stock to 0</p>
</li>
<li><p>PutItem for Order0046</p>
</li>
<li><p>Transaction succeeds, it's committed.</p>
</li>
</ol>
</li>
<li><p>Instance1 receives a success, returns a success to the frontend.</p>
</li>
<li><p>Instance2 executes a transaction:</p>
<ol>
<li><p>ConditionCheck for Product111, stock is <strong>not</strong> greater than 0 (actual value is 0)</p>
</li>
<li><p>Transaction fails, it's aborted.</p>
</li>
</ol>
</li>
<li><p>Instance2 receives an error, returns an error to the frontend.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1699554167872/faa3c3da-1f16-44a0-86c8-4c7966473d6a.png" alt="The process with transactions" class="image--center mx-auto" /></p>
<h2 id="heading-overview-of-dynamodb"><strong>Overview of DynamoDB</strong></h2>
<p>DynamoDB is so scalable because it's actually a distributed database, where you're presented with a single resource called Table, but behind the scenes there's multiple nodes that store the data and process queries. Data is partitioned using the Partition Key, which is part of the Primary Key (the other part is the Sort Key).</p>
<p>DynamoDB is highly available (meaning it can continue working if an Availability Zone goes down) because each partition is stored in 3 nodes, each in a separate Availability Zone. This is the "secret" behind DynamoDB's availability and durability. You don't need to know this to use DynamoDB effectively, but now that you do, you see that transactions in DynamoDB are actually distributed transactions.</p>
<h2 id="heading-how-transactions-work-in-dynamodb"><strong>How Transactions Work in DynamoDB</strong></h2>
<h3 id="heading-two-phase-commit"><strong>Two-Phase Commit</strong></h3>
<p>DynamoDB implements distributed transactions using Two-Phase Commit (2PC). This strategy is pretty simple: All nodes are requested to evaluate the transaction to determine whether they're capable of executing it, and only after all nodes report that they're able to successfully execute their part, the central controller sends the order to commit the transaction, and each node does the actual writing, affecting the actual data. For this reason, <strong>all operations done in a DynamoDB transaction consume twice as much capacity</strong>.</p>
<h3 id="heading-itempotency"><strong>Itempotency</strong></h3>
<p>DynamoDB transactions are idempotent. They're identified by an attribute called ClientRequestToken, which the DynamoDB SDK includes automatically on any transactions. If you use the TransactReadItems API or TransactWriteItems API without the SDK, you'll need to include it to achieve transaction idempotency.</p>
<h3 id="heading-isolation"><strong>Isolation</strong></h3>
<p>Transaction isolation (the I in ACID) is achieved through optimistic concurrency control. This means that multiple DynamoDB transactions can be executed concurrently, but if DynamoDB detects a conflict, one of the transactions will be rolled back and the caller will need to retry the transaction.</p>
<h3 id="heading-transactions-on-multiple-tables"><strong>Transactions on Multiple Tables</strong></h3>
<p>DynamoDB Transactions can span multiple tables, but they can't be performed on indexes. Also, propagation of the data to Global Secondary Indexes and DynamoDB Streams always happens after the transaction, and isn't part of it.</p>
<h3 id="heading-pricing-for-dynamodb-transactions"><strong>Pricing for DynamoDB Transactions</strong></h3>
<p>There is no direct cost for using transactions. However, all operations performed on DynamoDB as part of a transactions will consume double the amount of capacity units as they regularly would. Write and delete operations consume write capacity, and any condition expression consumes read capacity. This extra capacity is only consumed for the operations on the table, the read and write capacity consumed for updating secondary indexes and for <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-streams-reacting-to-changes?utm_source=blog&amp;utm_medium=hashnode">DynamoDB Streams</a> isn't affected. When working with <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-scaling-provisioned-on-demand?utm_source=blog&amp;utm_medium=hashnode">DynamoDB On-Demand Mode</a>, Request Units are doubled, just like Capacity Units.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-dynamodb-vs-sql-databases"><strong>DynamoDB vs SQL databases</strong></h2>
<p>The whole point of this article and the others I've written about DynamoDB is that SQL databases shouldn't be your default. I've shown you that DynamoDB can handle an e-commerce store just fine, including ACID-compliant transactions. That's because for an e-commerce, and in fact for 95% of the applications we write, we can predict data access patterns. When we can do that, we can optimize the structure and relations of a NoSQL database like DynamoDB and have it perform much better than a relational database for those known and predicted access patterns.</p>
<p>The use case for SQL databases is unknown access patterns! And those come from either giving the user a lot of freedom (which might be a mistake, or might be a core feature of your application), or from doing analytics and ad-hoc queries. In those cases, definitely go for relational databases. Otherwise, see if you can solve it with a NoSQL database like DynamoDB. It'll be much cheaper, and it will scale much better. I'll make one concession though: If all your dev team knows is SQL databases, just go with that unless you have a really strong reason not to.</p>
<h3 id="heading-using-sql-in-dynamodb"><strong>Using SQL in DynamoDB</strong></h3>
<p>This is gonna blow your mind: You can actually query DynamoDB using SQL! Or more specifically, a SQL-compatible language called <a target="_blank" href="https://aws.amazon.com/blogs/database/a-partiql-deep-dive-understanding-the-language-bringing-sql-queries-to-aws-non-relational-database-services/">PartiQL</a>. Amazon developed PartiQL as an internal tool, and it was made generally available by AWS. It can be used on SQL databases, semi-structured data, or NoSQL databases, so long as the engine supports it.</p>
<p>With PartiQL you could <strong>theoretically</strong> change your Postgres database for a DynamoDB database without rewriting any queries. In reality, you need to consider all of these points:</p>
<ul>
<li><p>Why are you even changing? It's not going to be easy.</p>
</li>
<li><p>How are you going to migrate all the data?</p>
</li>
<li><p>You need to make sure no queries are triggering a Scan in DynamoDB, because we know those are slow and very expensive. You can <a target="_blank" href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-iam.html#access-policy-ql-iam-example6">use an IAM policy to deny full-table Scans</a>.</p>
</li>
<li><p>Again, why are you even changing?</p>
</li>
</ul>
<p>I'm not saying there isn't a good reason to change, but I'm going to assume it's not worth the effort, and you'll have to prove me otherwise. Remember that replicating the data somewhere else for a different access pattern is a perfectly valid strategy (in fact, that's exactly how DynamoDB GSIs work). We'll discuss it further in a future issue.</p>
<h2 id="heading-are-there-any-limitations-to-using-transactions-in-dynamodb"><strong>Are there any limitations to using transactions in DynamoDB?</strong></h2>
<p>Yes, there are some limitations to using transactions in DynamoDB. Transactions are limited to a maximum of 100 unique items and the total data size within a transaction cannot exceed 4 MB. Additionally, transactions cannot operate on tables with global secondary indexes that have projected attributes.</p>
<h2 id="heading-best-practices">Best Practices</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><p><strong>Monitor transaction latencies:</strong> Monitor latencies of your DynamoDB transactions to identify performance bottlenecks and address them. Use CloudWatch metrics and AWS X-Ray to collect and analyze performance data.</p>
</li>
<li><p><strong>Error handling and retries:</strong> Handle errors and implement exponential backoff with jitter for retries in case of transaction conflicts.</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><strong>Fine-grained access control:</strong> Assign an IAM Role to your backend with an IAM Policy that only allows the specific actions that it needs to perform, only on the specific tables that it needs to access. You can even do this per record and per attribute. This is least privilege.</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><strong>Consider a Global Table:</strong> You can make your DynamoDB table multi-region using a Global Table. Making the rest of your app multi-region is more complicated than that, but at least the DynamoDB part is easy.</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><strong>Optimize provisioned throughput:</strong> If you're using Provisioned Mode, you'll need to set your Read and Write Capacity Units appropriately. You can also set them to auto-scale, but it's not instantaneous. Remember <a target="_blank" href="https://newsletter.simpleaws.dev/p/sqs-throttle-database-writes-dynamodb?utm_source=blog&amp;utm_medium=hashnode">the article on using SQS to throttle writes</a>.</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><strong>Optimize transaction sizes:</strong> Minimize the number of items and attributes involved in a transaction to reduce consumed read and write capacity units. Remember that transactions consume twice as much capacity, so optimizing the operations in a transaction is doubly important.</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Data Loss, Replication and Disaster Recovery on AWS]]></title><description><![CDATA[Note: This content was originally published at the Simple AWS newsletter.
Imagine this scenario: You have some data that's absolutely critical to your business. If you lose it, it's a disaster! How do you recover?
Data Loss Scenarios
First, we need t...]]></description><link>https://blog.guilleojeda.com/data-loss-replication-and-disaster-recovery-on-aws</link><guid isPermaLink="true">https://blog.guilleojeda.com/data-loss-replication-and-disaster-recovery-on-aws</guid><category><![CDATA[AWS]]></category><category><![CDATA[data]]></category><category><![CDATA[Devops]]></category><category><![CDATA[architecture]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Tue, 31 Oct 2023 18:37:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1698777328204/0580cad7-43a9-4b33-a127-95a760738a3e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This content was originally published at the</em> <a target="_blank" href="https://www.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode"><strong><em>Simple AWS newsletter</em></strong></a><em>.</em></p>
<p>Imagine this scenario: You have some data that's absolutely critical to your business. If you lose it, it's a disaster! How do you recover?</p>
<h2 id="heading-data-loss-scenarios">Data Loss Scenarios</h2>
<p>First, we need to define what we mean when we say "lose it". How do you lose data? Let's consider some scenarios, and what we can do to in each case.</p>
<h3 id="heading-data-loss-because-of-hardware-failure">Data Loss Because of Hardware Failure</h3>
<p>As I'm sure you know, computer hardware is sensitive equipment, which will inevitably fail at some point. When working with AWS we don't really see or manage the hardware, but we're still vulnerable to hardware failing. That's why AWS services publicly advertise their durability: <a target="_blank" href="https://newsletter.simpleaws.dev/p/ebs-basics-best-practices?utm_source=blog&amp;utm_medium=hashnode">EBS</a> has 99.9% or 99.999% depending on volume type, while S3 has 99.999999999% (referred to as 11 nines, or 11 9s). 99.999999999% (11 nines) durability means that if you store 10 million objects, then you expect to lose an object of your data every 10,000 years.</p>
<h4 id="heading-how-to-prevent-it">How to prevent it</h4>
<p>S3 should be more than enough to protect from bit rot or simultaneous hardware failures. If you're not storing your critical data in S3, start at least backing it up there. You can create Snapshots from EBS volumes or RDS instances, which are stored in S3.</p>
<h3 id="heading-data-loss-because-of-human-error">Data Loss Because of Human Error</h3>
<p>This includes scenarios where you or anyone on your team (with legitimate access and good intentions) accidentally deletes or overwrites data. In S3, it can be deleting an object or an entire bucket. In EBS, EFS or anything mounted in the file system, it can be a typo when running a command like <code>rm -rf</code>. In a database, it's more often than not a query ran with the wrong parameters, such as a SQL <code>UPDATE</code> with no <code>WHERE</code> clause.</p>
<p>Automated processes are also included in this scenario, since the reason they might delete or overwrite data that they shouldn't have touched is always due to human error when programming or configuring them.</p>
<h4 id="heading-how-to-prevent-it-1">How to prevent it</h4>
<p>The first step is to understand that anyone can make a mistake, no matter how skilled or careful. Training and clearly defined procedures will reduce the probability of mistakes, but they'll never take it to 0. Limiting accesses and implementing guardrails and additional confirmations will further reduce the probability.</p>
<p>Overall, the best way to protect from this is to have backups of the data that can't be overwritten or deleted through the same means. For example, you can copy database snapshots to another AWS account.</p>
<h3 id="heading-data-loss-because-of-hacks-or-ransomware">Data Loss Because of Hacks or Ransomware</h3>
<p>In this case, you're dealing with a malicious actor intentionally trying to delete the data, or make it inaccessible to you. The most common scenarios are Ransomware attacks, where an attacker either steals or encrypts the data, and asks you for money in exchange for granting you access to it.</p>
<p>Attackers gain the ability to affect your data in AWS through credentials. This can be your own username and password stolen, the IAM role of an EC2 instance that the attacker gained access to, or any other way that they can gain AWS credentials to your account.</p>
<h4 id="heading-how-to-prevent-it-2">How to prevent it</h4>
<p>Basic security measures such as <a target="_blank" href="https://newsletter.simpleaws.dev/p/7-must-do-security-best-practices-for-your-aws-account?utm_source=blog&amp;utm_medium=hashnode">security best practices for AWS accounts</a>, minimum privileges, and application security go a long way. Requiring Multi-Factor Authentication for certain AWS operations, such as deleting objects in S3, is another good measure.</p>
<p>What can happen is that an attacker gains some form of access, often not enough to compromise the entire AWS account or access your data, and then performs lateral movements and privilege escalations to gradually gain more access. A really simple example would be an EC2 instance with no access to S3 but an IAM Policy that grants permissions <code>iam:*</code>. An attacker with access to that instance can't immediately encrypt an S3 bucket, but they can use that instance's credentials to create a new IAM User for themselves, which has access to S3.</p>
<p>A way to protect against that is to store backups of data where they can't be tampered with from the same place where the data is. A good example, which I'll show you how to configure, is to set up a separate AWS account (let's call it Account B) and replicate there all objects in an S3 bucket. So long as there's no way to delete those objects in Account B from Account A, there's no path for an attacker with access to Account A to delete or encrypt the data in Account B. This doesn't completely eliminate the risk! But it makes it much less likely to occur, since an attacker would need to succeed at two separate attacks, one to gain access to Account A and one to Account B. This, coupled with ways to <a target="_blank" href="https://newsletter.simpleaws.dev/p/cloudtrail-cloudwatch-logs-login-detection-alert?utm_source=blog&amp;utm_medium=hashnode">detect failed sign in attempts to AWS</a>, significantly improves your security.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-disaster-recovery-metrics-rpo-and-rto">Disaster Recovery Metrics: RPO and RTO</h2>
<p>Before we move on to the solution, there's two things I want to discuss briefly, which will determine how often you perform your backups and what backup strategies and/or technologies you can use. They're two of the most common Disaster Recovery metrics:</p>
<h3 id="heading-recovery-point-objective-rpo">Recovery Point Objective (RPO)</h3>
<p>This is a measure of how much time can pass from when data is written to when it's backed up. It's measured in time units, usually in hours or minutes. Any data that was written less than "RPO" ago isn't guaranteed to be backed up, and in the event of a disaster it wouldn't be found in the backups, and would be lost.</p>
<p>For example, an RPO of 12 hours means any data that was written less than 12 hours ago isn't guaranteed to be backed up. A typical way to implement an RPO of 12 hours is to create backups twice a day, for example at 00:00 and 12:00. Data isn't guaranteed to not be backed up either, for example a disaster occurs at 13:00 the only data lost would be from the previous hour, which is the time since the last backup.</p>
<p>The reason RPO isn't exactly equivalent to time between backups is because backups can also be implemented in a continuous or nearly continuous way. For our backup every 12 hours example we're assuming the backup is instantaneous, which isn't exactly true but it's a good approximation. If we run backups every 1 minute, then the 30 seconds that the backup process might last isn't a number we can ignore, and our RPO would be 1 minute and 30 seconds.</p>
<p>The duration of the backup process is sometimes called replication delay. For example, Aurora has a replication delay of 1 minute between the primary instance and its replicas. That gives a Disaster Recovery strategy of using an Aurora replica a 1-minute RPO, since the replication process is started nearly immediately when data is changed.</p>
<p>Different data can have different RPOs. For example, data stored in S3 can have an RPO of 1 hour, and data on RDS an RPO of 6 hours. That's perfectly normal, and you should consider how bad it would be to lose all new data from the last X time to decide whether you're good with your numbers or need to improve them. RPO can be heavily constrained by the technology used to store the data and the technologies and techniques used to back it up. For example, an RPO of 1 hour is normal for S3 because the easiest backup method to set up for S3 is Cross-Region Replication, a feature already built into S3 with an RPO of 1 hour (15 minutes if you enable Replication Time Control).</p>
<h3 id="heading-recovery-time-objective-rto">Recovery Time Objective (RTO)</h3>
<p>This is the target time between when you detect that a failure is happening and when you have the backup live and serving traffic at the same quality of service as if there was no failure. It's measured in time units, usually minutes or hours.</p>
<p>For example, if you're backing up your RDS database with RDS Snapshots, your RTO is going to be more or less the time it takes you to create a new RDS instance from the snapshot (usually between 30 minutes and 2 hours, depending on the size of the snapshot).</p>
<p>More accurately, your RTO would be the time between when you detect a failure in the original RDS instance and when the new RDS instance created from the snapshot is serving traffic. If you've automated this process, 99% of the recovery time is going to be creating the RDS instance. If you haven't, you need to take into account the time it takes you to:</p>
<ol>
<li><p>Receive an alert</p>
</li>
<li><p>View that alert</p>
</li>
<li><p>Log in to the system</p>
</li>
<li><p>Find the correct snapshot</p>
</li>
<li><p>Figure out the correct configurations for the new RDS instance</p>
</li>
<li><p>Launch the creation of a new RDS instance</p>
</li>
<li><p>Switch over traffic to the new RDS instance</p>
</li>
</ol>
<p>Every part of that which you can automate will reduce your RTO, and also reduce the chance that you make a mistake and, for example, restore the incorrect snapshot, or create the new RDS instance in the wrong VPC. Automate whatever you can (you can automate all of that). Start with the longest sentences, I wrote them like that on purpose.</p>
<h2 id="heading-disaster-recovery-in-aws">Disaster Recovery in AWS</h2>
<p>In AWS jargon, Disaster Recovery means being able to get the entire system back online in the case of an AWS Region failing. For that, you'll need to have the data available in that other region, as well as any additional resources required to access the data, such as the KMS key used to encrypt it (remember that KMS keys are regional by default, and you can create multi-region keys).</p>
<p>Getting the entire system back online also requires you to stand up compute capacity (be it EC2 instances, an ECS on Fargate cluster, Lambda functions, etc), make the data accessible (e.g. launch an RDS instance from the copied RDS snapshot), and switch over traffic. It's a complex process, there are different strategies, and there are multiple things to take into account depending on your RTO and RPO.</p>
<p>The next post is going to be about Disaster Recovery strategies, and being prepared to deploy the entire system in another AWS region. The first step towards that is to have the data accessible, so let's focus on that.</p>
<h2 id="heading-how-to-configure-s3-replication-across-different-aws-accounts">How to Configure S3 Replication Across Different AWS Accounts</h2>
<p>Let's view a solution to back up data in S3 to another S3 bucket in a different AWS account. To better protect from different disaster scenarios, you should make sure access to this other AWS account is very restricted.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698776900461/abb78146-3fe8-4fac-b909-b2f602c59cb7.png" alt="S3 Replication Across Different AWS Accounts" class="image--center mx-auto" /></p>
<h3 id="heading-step-0-preparation">Step 0: Preparation</h3>
<ol>
<li><p>Log in to an AWS account, let's call it Account A.</p>
</li>
<li><p>Open the <a target="_blank" href="https://console.aws.amazon.com/s3/">S3 console</a></p>
</li>
<li><p>Click Create bucket</p>
</li>
<li><p>Enter a name for the source bucket (must be unique across all AWS)</p>
</li>
<li><p>Scroll down to Bucket Versioning and select Enable</p>
</li>
<li><p>Click Create</p>
</li>
<li><p>Copy or write down the Account ID of Account A (you'll need it later)</p>
</li>
<li><p>Log in to a different AWS account, let's call it Account B</p>
</li>
<li><p>Open the <a target="_blank" href="https://console.aws.amazon.com/s3/">S3 console</a></p>
</li>
<li><p>Click Create bucket</p>
</li>
<li><p>Enter a name for the destination bucket (must be unique across all AWS). Write down the name.</p>
</li>
<li><p>Scroll down to Bucket Versioning and select Enable</p>
</li>
<li><p>Click Create</p>
</li>
<li><p>Copy or write down the Account ID of Account B (you'll need it later)</p>
</li>
</ol>
<h3 id="heading-step-1-enable-replication-in-the-source-bucket">Step 1: Enable Replication in the Source Bucket</h3>
<ol>
<li><p>Log back in to Account A and go to S3</p>
</li>
<li><p>Click on the source bucket (the one you created on Step 0)</p>
</li>
<li><p>Click the Management tab</p>
</li>
<li><p>Scroll down to Replication rules and click Create replication rule</p>
</li>
<li><p>Under Replication rule name enter a name for the rule, such as cross-account replication</p>
</li>
<li><p>Under Choose a rule scope, select Apply to all objects in the bucket</p>
</li>
<li><p>In the Destination section, under Destination select Specify a bucket in another account</p>
</li>
<li><p>Under Account ID enter the Account ID of Account B (where the destination bucket is)</p>
</li>
<li><p>Under Bucket name, enter the name of the destination bucket</p>
</li>
<li><p>Check Change object ownership to destination bucket owner</p>
</li>
<li><p>In the IAM role section, under IAM role open the dropdown and select Create new role</p>
</li>
<li><p>Click Save</p>
</li>
<li><p>Click Submit (we don't have any existing objects, so what we choose here doesn't really matter)</p>
</li>
<li><p>In the Replication configuration settings section, under IAM role, copy the name of the IAM role that was created automatically</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698776910272/f9c114ad-40e1-4b09-951a-e9a6a574c56b.png" alt="Screenshot of S3 Replication" class="image--center mx-auto" /></p>
<h3 id="heading-step-2-update-the-policy-on-the-destination-bucket">Step 2: Update the Policy on the Destination Bucket</h3>
<ol>
<li><p>Log in to Account B and go to S3</p>
</li>
<li><p>Click on the destination bucket (the one you created on Step 0)</p>
</li>
<li><p>Click the Permissions tab</p>
</li>
<li><p>Next to Bucket policy, click Edit</p>
</li>
<li><p>In the following policy, replace <code>ID_OF_ACCOUNT_A</code> with the ID of the account where the source bucket is, <code>NAME_OF_THE_IAM_ROLE</code> with the last value you copied in Step 1, and <code>NAME_OF_THE_DESTINATION_BUCKET</code> with the name of the destination bucket. Then click Save.</p>
</li>
</ol>
<pre><code class="lang-json">{
   <span class="hljs-attr">"Version"</span>:<span class="hljs-string">"2012-10-17"</span>,
   <span class="hljs-attr">"Id"</span>:<span class="hljs-string">""</span>,
   <span class="hljs-attr">"Statement"</span>:[
      {
         <span class="hljs-attr">"Sid"</span>:<span class="hljs-string">"Set-permissions-for-objects"</span>,
         <span class="hljs-attr">"Effect"</span>:<span class="hljs-string">"Allow"</span>,
         <span class="hljs-attr">"Principal"</span>:{
            <span class="hljs-attr">"AWS"</span>:<span class="hljs-string">"arn:aws:iam::ID_OF_ACCOUNT_A:role/service-role/NAME_OF_THE_IAM_ROLE"</span>
         },
         <span class="hljs-attr">"Action"</span>:[<span class="hljs-string">"s3:ReplicateObject"</span>, <span class="hljs-string">"s3:ReplicateDelete"</span>],
         <span class="hljs-attr">"Resource"</span>:<span class="hljs-string">"arn:aws:s3:::NAME_OF_THE_DESTINATION_BUCKET/*"</span>
      },
      {
         <span class="hljs-attr">"Sid"</span>:<span class="hljs-string">"Set permissions on bucket"</span>,
         <span class="hljs-attr">"Effect"</span>:<span class="hljs-string">"Allow"</span>,
         <span class="hljs-attr">"Principal"</span>:{
            <span class="hljs-attr">"AWS"</span>:<span class="hljs-string">"arn:aws:iam::ID_OF_ACCOUNT_A:role/service-role/NAME_OF_THE_IAM_ROLE"</span>
         },
         <span class="hljs-attr">"Action"</span>:[<span class="hljs-string">"s3:List*"</span>, <span class="hljs-string">"s3:GetBucketVersioning"</span>, <span class="hljs-string">"s3:PutBucketVersioning"</span>],
         <span class="hljs-attr">"Resource"</span>:<span class="hljs-string">"arn:aws:s3:::NAME_OF_THE_DESTINATION_BUCKET"</span>
      }
   ]
}
</code></pre>
<h3 id="heading-step-3-upload-an-object-to-the-source-bucket">Step 3: Upload an Object to the Source Bucket</h3>
<ol>
<li><p>Log back in to Account A and go to S3</p>
</li>
<li><p>Click on the source bucket</p>
</li>
<li><p>Click Upload</p>
</li>
<li><p>Click Add files, select one or more files, and click Open</p>
</li>
<li><p>Click Upload</p>
</li>
<li><p>Verify that the file is uploaded to the Source bucket</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698776917944/3c5034e5-cc8b-4a5a-95a9-bd4fc5d80d66.png" alt="Screenshot of the S3 console with an object in the source S3 bucket" class="image--center mx-auto" /></p>
<h3 id="heading-step-4-upload-an-object-to-the-source-bucket">Step 4: Upload an Object to the Source Bucket</h3>
<ol>
<li><p>Log back in to Account B and go to S3</p>
</li>
<li><p>Click on the destination bucket</p>
</li>
<li><p>Verify that the same file you uploaded to the source bucket is present on the destination bucket</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698776927114/52bebfef-3b08-45ef-b761-0a80427c3dce.png" alt="Screenshot of the S3 console with the object in the destination S3 bucket" class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-recommended-tools-and-resources-about-data-replication"><strong>Recommended Tools and Resources about Data Replication</strong></h2>
<p>When data is replicated across several disks, data loss happens when one of those disks fails, and in the time while a new copy is being created to replace the one that just failed, another disk also fails. The probability of losing data clearly depends on how often a disk fails (called Mean Time Between Failures, or MTBF) and how long it takes to recreate that copy (called Mean Time To Recovery or MTTR). But if I gave you those numbers, would you know how to calculate the probability of data loss? I didn't, until I read <a target="_blank" href="https://blog.synology.com/data-durability">this article</a>!</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Detecting Failed Sign In Attempts to AWS and Alerting]]></title><description><![CDATA[Note: This content was originally published at the Simple AWS newsletter.
Imagine this scenario: You're careful with security, and you set up Multi-Factor Authentication for your AWS IAM or IAM Identity Center user. At one point, a malicious agent of...]]></description><link>https://blog.guilleojeda.com/detecting-failed-sign-in-attempts-to-aws-and-alerting</link><guid isPermaLink="true">https://blog.guilleojeda.com/detecting-failed-sign-in-attempts-to-aws-and-alerting</guid><category><![CDATA[AWS]]></category><category><![CDATA[Security]]></category><category><![CDATA[best practices]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 26 Oct 2023 15:26:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1698333697034/1cf2c11e-ee87-46ed-a495-65d5ecd74f1d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This content was originally published at the</em> <a target="_blank" href="https://www.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode"><strong><em>Simple AWS newsletter</em></strong></a><em>.</em></p>
<p>Imagine this scenario: You're careful with security, and you set up Multi-Factor Authentication for your AWS IAM or IAM Identity Center user. At one point, a malicious agent of evilness figures out your password, either via phishing, keylogging, or any other technique. The only thing standing between them and $50k in bitcoin mined on your AWS account (and the $500k AWS bill paid with your credit card) is your MFA device. They don't have access it (yet!), so you're safe (for now!).</p>
<p>The correct response is obvious: Change your password, so they'll be back to square one. The problem with this is that most of the times you don't know that a password has become known until someone uses it. This is the reason why we should rotate passwords regularly!</p>
<p>Let me propose an extra layer of security: <strong>Get notified every time a login attempt fails because of a failed MFA check.</strong></p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-understanding-cloudtrail-event-logs">Understanding CloudTrail event logs</h2>
<p>AWS CloudTrail is a service that logs all requests performed against AWS in your account. This includes all actions and requests (even unauthorized ones) done through the Console, CLI, AWS SDKs, and APIs.</p>
<p>By default (already enabled when you create your account) and for free, CloudTrail offers you a viewable, searchable, downloadable, and immutable record of all events that happened in the past 90 days, called Event History.</p>
<p>You can also create Trails, which let you export to S3 or CloudWatch Logs all or a subset of CloudTrail events. This way, you can use other AWS services like Athena and OpenSearch, or external tools like ElasticSearch, to analyze CloudTrail events. Event history is per region, while Trails can be created for all regions in an account, and even for all accounts in an Organization.</p>
<p>Events are JSON objects, which contain all the information of what was attempted, when, by whom, some other details on the request, what was the result, and sometimes some other details on the response, such as the reason.</p>
<p>For example, this is what a ConsoleLogin event that fails looks like (I hid a few details replacing them with HIDDEN):</p>
<pre><code class="lang-json">{
    <span class="hljs-attr">"eventVersion"</span>: <span class="hljs-string">"1.08"</span>,
    <span class="hljs-attr">"userIdentity"</span>: {
        <span class="hljs-attr">"type"</span>: <span class="hljs-string">"IAMUser"</span>,
        <span class="hljs-attr">"principalId"</span>: <span class="hljs-string">"HIDDEN"</span>,
        <span class="hljs-attr">"accountId"</span>: <span class="hljs-string">"HIDDEN"</span>,
        <span class="hljs-attr">"accessKeyId"</span>: <span class="hljs-string">"HIDDEN"</span>,
        <span class="hljs-attr">"userName"</span>: <span class="hljs-string">"HIDDEN"</span>
    },
    <span class="hljs-attr">"eventTime"</span>: <span class="hljs-string">"2023-08-24T21:07:08Z"</span>,
    <span class="hljs-attr">"eventSource"</span>: <span class="hljs-string">"signin.amazonaws.com"</span>,
    <span class="hljs-attr">"eventName"</span>: <span class="hljs-string">"ConsoleLogin"</span>,
    <span class="hljs-attr">"awsRegion"</span>: <span class="hljs-string">"us-east-1"</span>,
    <span class="hljs-attr">"sourceIPAddress"</span>: <span class="hljs-string">"HIDDEN"</span>,
    <span class="hljs-attr">"userAgent"</span>: <span class="hljs-string">"HIDDEN"</span>,
    <span class="hljs-attr">"errorMessage"</span>: <span class="hljs-string">"Failed authentication"</span>,
    <span class="hljs-attr">"requestParameters"</span>: <span class="hljs-literal">null</span>,
    <span class="hljs-attr">"responseElements"</span>: {
        <span class="hljs-attr">"ConsoleLogin"</span>: <span class="hljs-string">"Failure"</span>
    },
    <span class="hljs-attr">"additionalEventData"</span>: {
        <span class="hljs-attr">"LoginTo"</span>: <span class="hljs-string">"https://console.aws.amazon.com/console/home?HIDDEN"</span>,
        <span class="hljs-attr">"MobileVersion"</span>: <span class="hljs-string">"No"</span>,
        <span class="hljs-attr">"MFAUsed"</span>: <span class="hljs-string">"Yes"</span>
    },
    <span class="hljs-attr">"eventID"</span>: <span class="hljs-string">"HIDDEN"</span>,
    <span class="hljs-attr">"readOnly"</span>: <span class="hljs-literal">false</span>,
    <span class="hljs-attr">"eventType"</span>: <span class="hljs-string">"AwsConsoleSignIn"</span>,
    <span class="hljs-attr">"managementEvent"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-attr">"recipientAccountId"</span>: <span class="hljs-string">"HIDDEN"</span>,
    <span class="hljs-attr">"eventCategory"</span>: <span class="hljs-string">"Management"</span>,
    <span class="hljs-attr">"tlsDetails"</span>: {
        <span class="hljs-attr">"tlsVersion"</span>: <span class="hljs-string">"TLSv1.3"</span>,
        <span class="hljs-attr">"cipherSuite"</span>: <span class="hljs-string">"TLS_AES_128_GCM_SHA256"</span>,
        <span class="hljs-attr">"clientProvidedHostHeader"</span>: <span class="hljs-string">"signin.aws.amazon.com"</span>
    }
}
</code></pre>
<h2 id="heading-how-are-iam-and-iam-identity-center-events-logged-in-cloudtrail">How are IAM and IAM Identity Center events logged in CloudTrail?</h2>
<p>Naturally, CloudTrail logs events when an IAM or IAM IC user logs in to AWS. However, it's not just one event. Let me show you what happens behind the scenes with IAM IC and IAM.</p>
<h3 id="heading-cloudtrail-events-for-iam-identity-center-users-logging-in">CloudTrail Events for IAM Identity Center users logging in</h3>
<p>These are the events that CloudTrail logs when an IAM or IAM IC user logs in:</p>
<ul>
<li><p><strong>CredentialChallenge:</strong> AWS requested some form of credential, such as password or MFA device. Each of these is followed by UserAuthentication and one CredentialVerification event, and this sequence of three is repeated until either all necessary credentials are provided, or CredentialVerification fails.</p>
</li>
<li><p><strong>UserAuthentication:</strong> AWS receives the requested credentials.</p>
</li>
<li><p><strong>CredentialVerification:</strong> AWS checks whether the received credentials are valid. If they are, this event contains: <code>"serviceEventDetails":{ "CredentialChallenge":"Success" }</code>, and the process continues either by requesting the next credential or by authenticating. If the credentials received are invalid, this event contains <code>"serviceEventDetails":{ "CredentialChallenge":"Failure" }</code> and the process stops. The type of credential requested can be found in <code>"additionalEventData":{ "CredentialType":"PASSWORD" }</code>, the value of which can be <code>PASSWORD</code> for the regular password, <code>TOTP</code> for MFA devices that produce temporary codes, <code>WEBAUTHN</code> for web apps using WebAuthn, <code>EXTERNAL_IDP</code> for external identity providers, or <code>RESYNC_TOTP</code> to re-synchronize TOTP devices.</p>
</li>
<li><p><strong>Authenticate:</strong> Once CredentialVerification succeeds and no more credentials are required, this event is logged. This means the user successfully authenticated to IAM Identity Center. If you're authenticating with an external Identity Provider such as Google Workspaces, Microsoft AD or Okta, this is the only event you'll see.</p>
</li>
<li><p><strong>Federate:</strong> This event means the IAM IC user assumed an IAM role in an AWS account.</p>
</li>
<li><p><strong>ConsoleLogin:</strong> This event means the IAM IC user logged in to the AWS Console using the assumed IAM role.</p>
</li>
</ul>
<p>The first four events will be logged on the AWS account and region where IAM IC is configured. The last two are logged in the AWS account where the user signs in, in the default region for that user (i.e. the one that's selected when the user signs in).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698333661768/557e2163-f35e-4bfb-80f4-25702026b2e0.png" alt="Screenshot of CloudTrail events" class="image--center mx-auto" /></p>
<h3 id="heading-cloudtrail-events-for-iam-users-logging-in">CloudTrail Events for IAM users logging in</h3>
<p>IAM is much simpler. The only event is ConsoleLogin, and this is how you can identify what happened:</p>
<ul>
<li><p>If the user signed in successfully, the event contains <code>"responseElements": { "ConsoleLogin": "Success" }</code></p>
</li>
<li><p>If sign in failed, the event contains <code>"responseElements": { "ConsoleLogin": "Failure" }</code> and can contain <code>"errorMessage": "Failed authentication"</code></p>
</li>
<li><p>If the user used MFA (regardless of success or failure), the event contains <code>"additionalEventData": { "MFAUsed": "Yes" }</code></p>
</li>
<li><p>If the user is the root user, the event contains <code>"userIdentity": { "type": "Root" }</code></p>
</li>
</ul>
<p>All of these events are always logged in the us-east-1 region.</p>
<h2 id="heading-identifying-when-signing-in-fails-due-to-mfa">Identifying when signing in fails due to MFA</h2>
<p>The last section was a bit detailed, but it contains all the information we need to identify when a login attempt fails the MFA check!</p>
<p><strong>For IAM IC</strong>, we need to find CredentialVerification events which contain either <code>"additionalEventData":{ "CredentialType":"TOTP" }</code> or <code>"additionalEventData":{ "CredentialType":"WEBAUTHN" }</code>, and contain <code>"serviceEventDetails":{ "CredentialChallenge":"Failure" }</code>.</p>
<p><strong>For IAM</strong>, we need to find ConsoleLogin events which contain <code>"responseElements": { "ConsoleLogin": "Failure" }</code> and <code>"additionalEventData": { "MFAUsed": "Yes" }</code>.</p>
<p>You can do this manually in the CloudTrail event history, but if we want to automate it and get notifications, we need to send it to CloudWatch Logs.</p>
<h2 id="heading-sending-cloudtrail-events-to-cloudwatch-logs">Sending CloudTrail events to CloudWatch Logs</h2>
<ol>
<li><p>Sign in to the AWS Console. For IAM IC, sign in to the root account of the Organization</p>
</li>
<li><p>Go to CloudTrail</p>
</li>
<li><p>Click Create trail</p>
</li>
<li><p>For Trail name, enter management-events</p>
</li>
<li><p>If you're configuring this for an organization, check Enable for all accounts in my organization</p>
</li>
<li><p>Select Create new S3 bucket and under Trail log bucket and folder enter a name, or leave it as default</p>
</li>
<li><p>Under AWS KMS alias, enter a name for a KMS key to encrypt the logs</p>
</li>
<li><p>Under CloudWatch Logs, check Enabled</p>
</li>
<li><p>Under Log group name, enter a name for the CloudWatch Logs log group</p>
</li>
<li><p>Under Role name, enter a name for the IAM Role that'll let CloudTrail put logs to CloudWatch Logs</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Leave these options as default and click Next again</p>
</li>
<li><p>Click Create trail</p>
</li>
<li><p>Wait until the Status column changes to Logging (green)</p>
</li>
</ol>
<h2 id="heading-configuring-alerts-and-notifications-for-cloudwatch-logs">Configuring Alerts and Notifications for CloudWatch Logs</h2>
<p>First, let's create a filter to view the failed logins in CloudWatch Logs. Follow the steps, and on Step 6 choose the right filter depending on whether you're using IAM or IAM IC.</p>
<h3 id="heading-filtering-failed-login-attempts-due-to-mfa">Filtering failed login attempts due to MFA</h3>
<ol>
<li><p>Open the CloudWatch console</p>
</li>
<li><p>In the panel on the left, under Logs, click Log groups.</p>
</li>
<li><p>Click on the name of the log group that you created for the trail</p>
</li>
<li><p>Click the Metrics filters tab, and click the button Create metric filter</p>
</li>
<li><p>Under Filter pattern, enter the pattern that corresponds to IAM or IAM IC, depending on which you're using:</p>
<ol>
<li><p>For IAM: <code>{ ($.eventName = ConsoleLogin) &amp;&amp; ($.additionalEventData.MFAUsed = "Yes") &amp;&amp; ($.responseElements.ConsoleLogin = "Failure") }</code></p>
</li>
<li><p>For IAM IC: <code>{ ($.eventName = CredentialVerification) &amp;&amp; (($.additionalEventData.CredentialType = "TOTP") || ($.additionalEventData.CredentialType = "WEBAUTHN")) &amp;&amp; ($.serviceEventDetails.CredentialChallenge = "Failure")}</code></p>
</li>
</ol>
</li>
<li><p>Click Next</p>
</li>
<li><p>For Filter name, enter SignInFailedMFA</p>
</li>
<li><p>Under Metric namespace, enter CloudTrailMetrics</p>
</li>
<li><p>For Metric name, enter SigninMFAFailureCount</p>
</li>
<li><p>For Metric value, enter 1.</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Click Create metric filter</p>
</li>
</ol>
<h3 id="heading-configuring-alerts-for-failed-login-attempts-due-to-mfa">Configuring alerts for failed login attempts due to MFA</h3>
<ol>
<li><p>On the Metric filters tab, find the metric filter you just created, select it and click Create alarm</p>
</li>
<li><p>Under Whenever SigninMFAFailureCount is..., select Greater/Equal</p>
</li>
<li><p>Under than…, enter 1</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Under Send a notification to the following SNS topic, select Create new topic</p>
</li>
<li><p>Under Create a new topic…, enter SignInFailedMFAAlarm</p>
</li>
<li><p>Under Email endpoints that will receive the notification…, enter your email address</p>
</li>
<li><p>Click the Create topic button that's right below where you just entered your email address</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Under Alarm name, enter SignInFailedMFAAlarm</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Click Create alarm</p>
</li>
<li><p>Open your email inbox, open the email titled AWS Notification - Subscription Confirmation, and click Confirm subscription</p>
</li>
</ol>
<h2 id="heading-testing-alerts-when-signing-in-to-aws-fails-due-to-incorrect-mfa-code">Testing Alerts When Signing in to AWS Fails Due to Incorrect MFA Code</h2>
<p>To test it, you'll need to attempt to sign in, and enter the incorrect MFA code. There's going to be a delay of a few minutes between when the event actually occurs and when you receive the notification. CloudTrail usually takes an average of 5 minutes to export the event logs to CloudWatch Logs, and the alarm could take a couple more minutes to fire.</p>
<p>You can also monitor the alarm in CloudWatch Alarms:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698333632290/7f86a818-ffc0-4489-9dca-5f8e342c286c.png" alt="Screenshot of the SignInFailedMFAAlarm in alarm state" class="image--center mx-auto" /></p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Server-Side Rendering with AWS Amplify]]></title><description><![CDATA[Note: This content was originally published at the Simple AWS newsletter.
Back in the Paleolithic, which for software means 30 years ago, we had HTML, CSS and JavaScript, and we wrote all the structure of the website in HTML. When we wanted to create...]]></description><link>https://blog.guilleojeda.com/server-side-rendering-with-aws-amplify</link><guid isPermaLink="true">https://blog.guilleojeda.com/server-side-rendering-with-aws-amplify</guid><category><![CDATA[AWS]]></category><category><![CDATA[Frontend Development]]></category><category><![CDATA[Server side rendering]]></category><category><![CDATA[AWS Amplify]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Tue, 24 Oct 2023 18:31:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1698172044299/2404a95d-ba34-40e0-8bd7-55f968c7590f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This content was originally published at the</em> <a target="_blank" href="https://www.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode"><strong><em>Simple AWS newsletter</em></strong></a><em>.</em></p>
<p>Back in the Paleolithic, which for software means 30 years ago, we had HTML, CSS and JavaScript, and we wrote all the structure of the website in HTML. When we wanted to create dynamic content for our website, we added PHP or Java code in the middle of that HTML, and we would run that code on the server and it would output the HTML that we wanted to put there. That process of producing the final HTML for the website by running code is called rendering, and back then it happened Server-Side.</p>
<p>Then came web components and frameworks like Angular, React and Vue, and they changed the paradigm a bit. Now we didn't write plain HTML with snippets of PHP or Java, but instead we wrote JavaScript code that would then output the entire HTML code. Since it was JavaScript, we could run it on the browser, at the user's computer. That's called Client-Side Rendering.</p>
<p>A huge benefit of Client-Side Rendering was that we were using the user's CPU. That was awesome for our pockets, we didn't have to pay for that compute capacity. However, it meant the user had to wait however long their computer took to render that website. Mind you, it wasn't considered a big problem back then. Remember that back in those days users' expectations were pretty different: it was considered normal to wait 3 or 5 seconds for a website to load. Nowadays, more than 2 seconds seems like an eternity.</p>
<p>As compute capacity got cheaper, someone had the idea of bringing the rendering back to the server. The idea was to use our significantly more powerful servers to render the website much faster than what the user computer could achieve, reducing load time significantly. Sure, we were paying extra, but overall the improved user experience was worth the extra money. Cloud computing played a big part there as well: Not only was infrastructure cheaper per CPU-hour, but it also required less engineering hours to create and maintain.</p>
<p>We didn't bring back the old languages though! Instead, we started running that same React code in our servers, as if it were the user's browser. A big reason for this is that, while all of this back and forth was happening, frontends got increasingly more complex. That led to developers being classified as frontend or backend, and while there are full stack devs, most are just strong on one side and really weak on the other one (I'm one of those cases!). So, the folks who knew how to code UIs didn't know Java or PHP (not even backend devs like those nowadays), and the folks who did know those languages didn't know how to code UIs. The solution? Let's create a framework like Next.js that runs that same React code in our servers. That way, we get the best of both worlds: Frontend frameworks with Server-Side Rendering!</p>
<p>That's how we went from Server-Side Rendering, to Client-Side Rendering, and back to Server-Side Rendering. I've said it often, technology is cyclic.</p>
<h2 id="heading-how-does-amplify-help">How does Amplify help</h2>
<p>AWS Amplify is a set of tools that help frontend devs use AWS infrastructure and even build backends, without knowing a lot about infrastructure or backend. There's two parts to it: <strong>Amplify Hosting</strong> is a managed service that provides hosting and CI/CD for serverless apps. <strong>Amplify Studio</strong> is a visual development environment that lets you build a UI and a backend as with a no-code tool, with reusable components. We're going to use Amplify Hosting, but you might be interested in checking out Amplify Studio.</p>
<h2 id="heading-why-not-an-ec2-instance">Why not an EC2 instance?</h2>
<p>We know that AWS Amplify is going to be using EC2 instances behind the scenes, right? So, since we really know our way around AWS (I mean, you're reading a newsletter about AWS!), why shouldn't we just use EC2 instances for this instead of relying on a managed service? You could! It's a bit of our classic buy vs build decision, right? And you know by now that I always recommend you default to managed services and only build if there's a good reason to do so. I'm tempted to recommend the same here, but Amplify takes the managed in managed service and cranks it up to 11, and the price reflects that.</p>
<p>Amplify works as a self-service platform that people with little to no knowledge of cloud infrastructure can use to develop and deploy their applications. This is like an extremely managed big service: You're no longer solving just one part of the problem, but the entire problem of hosting an app in the cloud. It does get pretty expensive, so you probably want to consider EC2, or an ECS cluster on Fargate (more expensive than plain EC2, but easier). Let's run some numbers, so you can see how much it can really cost you.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-aws-amplify-pricing">AWS Amplify Pricing</h2>
<p>Here's how Amplify charges you:</p>
<ul>
<li><p>Build and deploy: $0.01 per minute</p>
</li>
<li><p>Data storage: $0.023 per GB per month</p>
</li>
<li><p>Data transfer out: $0.15 per GB served</p>
</li>
<li><p>Requests for SSR: $0.30 per 1 million requests + $0.20 per GB-hour</p>
</li>
</ul>
<p>For example, say you're a startup with the following assumptions:</p>
<ul>
<li><p>10,000 daily active users</p>
</li>
<li><p>5 devs, each doing 2 commits a day</p>
</li>
<li><p>Average build time is 3 minutes</p>
</li>
<li><p>The team works Monday to Friday (20 days/month) <strong>&lt;-- this is the least realistic assumption for startups</strong></p>
</li>
<li><p>The app is 25 MB</p>
</li>
<li><p>Average page size is 1.5 MB</p>
</li>
</ul>
<p>Here are our calculations:</p>
<p><code>Total build time per month = devs * commits/day * days/month * avg. build time = 5 * 2 * 20 * 3 = 600 build minutes per month. 600 * $0.01 = $6</code></p>
<p><code>Monthly GB served = daily active users * average page size * days/month = 10,000 * (1.5/1024) * 30 = 439.45 GB. 439.45 * $0.15 = $65.92</code></p>
<p><code>Monthly GB storage = app size * builds/month = (25/1024)(5*2*20) = 4.88 GB. 4.88 * $0.023 = $0.11</code></p>
<p><strong>Total charges = $6 + $65.92 + $0.11 = $72.03/month</strong></p>
<p>Ok, that wasn't so bad, right? Well, it's dominated by Monthly GB served, so let's run those numbers with CloudFront:</p>
<ul>
<li><p>For the most expensive regions: <code>439.45 * $0.12 = $52.73</code></p>
</li>
<li><p>For the least expensive regions: <code>439.45 * $0.085 = $37,35</code></p>
</li>
</ul>
<p>Amplify is 25% to 75% more expensive! Is that a lot? It depends. 75% more on $0,10 is less than the electricity I spent typing this sentence (shame on me! especially for adding this parenthesis to make the sentence long enough for that claim to be true). 75% more on $1.000 is $750, I'll set up CloudFront for you for that money!</p>
<h2 id="heading-aws-amplify-for-the-backend">AWS Amplify for the backend</h2>
<p>Amplify also lets you host a backend, which it runs in Lambda functions. You don't have a lot of control over it, but it works well for its intended audience: People who wouldn't know what to do if they had a lot of control over their Lambda functions. Amplify also lets you consume other AWS services easily, through declarative and easy-to-use libraries. That way, you can consume Cognito or S3 from the frontend without knowing a lot about Cognito or S3. Here's the <a target="_blank" href="https://github.com/aws-amplify">complete list of libraries for Amplify</a>, and you can check the Readme of <a target="_blank" href="https://github.com/aws-amplify/amplify-js">the JavaScript one</a> as an example of its features.</p>
<h2 id="heading-scenario">Scenario</h2>
<p>You work 8 hours a day as an engineer, and you want to launch a startup in your free time. You know you want good practices, but you don't have the time to set everything up manually. You want to start with the website, built with React and Next.js. You want Server-Side Rendering, and obviously a CI/CD Pipeline. You want everything to be serverless so you pay as little as possible while you get the hang of running a startup, getting users and all of that (which is actually the most difficult part).</p>
<h2 id="heading-solution">Solution</h2>
<p>Host the application in AWS Amplify, which handles hosting and CI/CD. As you build the backend, either use Amplify to create your Lambda functions, or go with something more traditional like Serverless or SAM, depending on your expertise.</p>
<h2 id="heading-step-by-step-instructions">Step-by-step Instructions</h2>
<h3 id="heading-step-0-setup">Step 0: Setup</h3>
<ol>
<li><p>Download and install Node.js from <a target="_blank" href="https://nodejs.org/en/download">the official website</a>. Alternatively, install nvm and your favorite version of Node.js</p>
</li>
<li><p>Install npm if it didn't come with Node.js.</p>
</li>
<li><p>Install yarn: <code>npm install --global yarn</code></p>
</li>
<li><p>Install git</p>
</li>
</ol>
<h3 id="heading-step-1-create-a-nextjs-app">Step 1: Create a Next.js app</h3>
<ol>
<li><p>Open a terminal</p>
</li>
<li><p>Run the following command: <code>yarn create next-app</code></p>
</li>
<li><p>Follow the prompts:</p>
<p> What is your project named? simpleaws-app</p>
<p> Would you like to use TypeScript? No</p>
<p> Would you like to use ESLint? Yes</p>
<p> Would you like to use Tailwind CSS? No</p>
<p> Would you like to use <code>src/</code> directory? Yes</p>
<p> Would you like to use App Router? (recommended) Yes</p>
<p> Would you like to customize the default import alias? No</p>
</li>
<li><p>Change directories to the app's directory: <code>cd simpleaws-app</code></p>
</li>
<li><p>Start the app locally: <code>yarn dev</code></p>
</li>
<li><p>Open your browser, go to http://localhost:3000 and check that the page loads correctly.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698172057997/8b05d636-d6b0-4f23-a375-4d0bbb1e0745.png" alt="Screenshot of the app working" class="image--center mx-auto" /></p>
<h3 id="heading-step-2-create-a-git-repo-for-the-app">Step 2: Create a git repo for the app</h3>
<ol>
<li><p>Go to <a target="_blank" href="https://github.com/new">https://github.com/new</a> and create a new repository</p>
</li>
<li><p>Init the repo locally: <code>git init</code></p>
</li>
<li><p>Add the GitHub repository as an origin (replace <code>YOUR_USERNAME</code> and <code>PROJECT_NAME</code> with your values): <code>git remote add origin git@github.com\:username/project-name.git</code></p>
</li>
<li><p>Add the files, commit and push:<br /> <code>git add .</code></p>
<p> <code>git commit -m 'initial commit'</code></p>
<p> <code>git push origin main</code></p>
</li>
</ol>
<h3 id="heading-step-3-create-an-amplify-project">Step 3: Create an Amplify project</h3>
<ol>
<li><p>Go to the <a target="_blank" href="https://us-east-1.console.aws.amazon.com/amplify/home">Amplify console</a></p>
</li>
<li><p>Scroll down to the <strong>Get started</strong> section, and under Amplify Hosting, click Get started</p>
</li>
<li><p>Select GitHub and click Continue</p>
</li>
<li><p>Click Authorize AWS Amplify (us-east-1) (the green button)</p>
</li>
<li><p>Select your user or organization and click Continue</p>
</li>
<li><p>Select Only select repositories, click the dropdown Select repositories and click on your repo. Click Install &amp; Authorize</p>
</li>
<li><p>Authenticate to GitHub with your security key, or click Use your password (it's in a really small font) and enter your password.</p>
</li>
<li><p>Click the dropdown under Recently updated repositories and select your repo. Leave branch as main, and click Next</p>
</li>
<li><p>Verify the Build and test settings (they should be fine, they were created by Next.js when you initialized the project)</p>
</li>
<li><p>Check Allow AWS Amplify to automatically deploy all files hosted in your project root directory</p>
</li>
<li><p>Click Next</p>
</li>
<li><p>Click Save and deploy</p>
</li>
<li><p>Wait until Provision, Build and Deploy are done</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698172104468/fe3773f3-c21b-4480-895a-d8b2046877a7.png" alt="Provision, Build and Deploy steps" class="image--center mx-auto" /></p>
<h3 id="heading-step-4-test-the-app">Step 4: Test the app</h3>
<ol>
<li>Click on the link under the window icon with the Amazon arrow. Verify that the website loads correctly.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1698172126549/dda6b2cb-6917-4f18-bd66-4efa2b975faf.png" alt="How to open the website" class="image--center mx-auto" /></p>
<h3 id="heading-step-5-delete-the-app">Step 5: Delete the app</h3>
<ol>
<li><p>On the top right corner click Actions</p>
</li>
<li><p>Click Delete app</p>
</li>
<li><p>Enter "delete"</p>
</li>
<li><p>Click Delete</p>
</li>
</ol>
<h2 id="heading-explanation">Explanation</h2>
<h3 id="heading-step-0-setup-1">Step 0: Setup</h3>
<p>Just installing some tools and dependencies.</p>
<h3 id="heading-step-1-create-a-nextjs-app-1">Step 1: Create a Next.js app</h3>
<p>Next.js has this awesome project initializer, that creates all the basic scaffolding you need. It is a bit opinionated, but I've never met anyone who doesn't like it (if you don't like it, let me know, you'll be the first!). Besides, it fits the scenario: You just want to get things done.</p>
<h3 id="heading-step-2-create-a-git-repo-for-the-app-1">Step 2: Create a git repo for the app</h3>
<p>We need the app in a git repo so Amplify can track changes to branches and pull the code from there. You can use GitHub, GitLab, Bitbucket or CodeCommit. You can also not set up a git repo and upload the code manually (or reference an S3 bucket), but in that case Amplify can't do the CI/CD for you.</p>
<h3 id="heading-step-3-create-an-amplify-project-1">Step 3: Create an Amplify project</h3>
<p>The only odd thing here would be the build steps. Amplify auto-detects them from your package.json file, and creates the configuration file that you saw in that step. Of course, you can edit it. I'd recommend you keep it in line with your package.json file though.</p>
<h3 id="heading-step-4-test-the-app-1">Step 4: Test the app</h3>
<p>After everything is deployed, Amplify will be serving your app in a URL that looks something like <code>branch-name.d1m7bkiki6tdw1.amplifyapp.com</code>. You can set up a custom domain through the Amplify console. Don't just go to Route 53 and point a domain to your Amplify URL, that'll fail because of the SSL certificates.</p>
<h3 id="heading-step-5-delete-the-app-1">Step 5: Delete the app</h3>
<p>Let's not pretend like we always remember to delete the things we deploy. This is a friendly reminder to delete the app!</p>
<hr />
<h2 id="heading-best-practices">Best Practices</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><p><strong>Set up a branch for each environment:</strong> Create a dev env where Amplify deploys from the dev branch, and a prod env where Amplify deploys from the main branch. Only commit to those branches through pull requests. That way, merging a PR means a release in that environment.</p>
</li>
<li><p><strong>Monitor Performance:</strong> Amplify has a monitoring service that allows you to view logs, build events, and other metrics.</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><strong>Use Amplify's Built-in Authentication:</strong> Amplify integrates with Cognito for user authentication. This lets you use a Cognito user pool really easily.</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><strong>Data Backup and Versioning:</strong> If you're using Amplify DataStore, regularly backup your data.</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><strong>API Caching:</strong> If you're using GraphQL with AppSync, enable caching to improve API response times and reduce the load on your backend.</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><strong>Consider not using Amplify:</strong> I don't want to position myself for or against Amplify, because the decision depends on your scenario and your problems (remember there is no single best solution!). Just consider the costs, the tradeoffs, and analyze what's the best solution for you. It may be Amplify, it may be doing things manually! Overall, remember that the cost of maintaining software is much higher than the cost of building the initial version.</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Understanding How DynamoDB Scales]]></title><description><![CDATA[Note: This content was originally published at the Simple AWS newsletter.
As you probably know, DynamoDB is a NoSQL database. It's a managed, serverless service, meaning you just create a Table (that's the equivalent of a Database in Postgres), and A...]]></description><link>https://blog.guilleojeda.com/understanding-how-dynamodb-scales</link><guid isPermaLink="true">https://blog.guilleojeda.com/understanding-how-dynamodb-scales</guid><category><![CDATA[AWS]]></category><category><![CDATA[DynamoDB]]></category><category><![CDATA[Databases]]></category><category><![CDATA[architecture]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 19 Oct 2023 14:40:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1697726322878/f8784b70-447b-4fd5-bcb7-975452e80e57.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Note: This content was originally published at the</em> <a target="_blank" href="https://www.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode"><strong><em>Simple AWS newsletter</em></strong></a><em>.</em></p>
<p>As you probably know, DynamoDB is a NoSQL database. It's a managed, serverless service, meaning you just create a Table (that's the equivalent of a Database in Postgres), and AWS manages the underlying nodes. It's highly available, meaning nodes are distributed across 3 AZs, so loss of an AZ doesn't bring down the service. The nodes aren't like an RDS Failover Replica though, instead data is partitioned (that's why Dynamo has a Partition Key!) and split across nodes, plus replicated on other nodes for availability and resilience. That means DynamoDB can scale horizontally!</p>
<p>There are two modes for DynamoDB, which affect how it scales and how you're billed:</p>
<h2 id="heading-dynamodb-provisioned-mode">DynamoDB Provisioned Mode</h2>
<p>You define some capacity, and DynamoDB provisions that capacity for you. This is pretty similar to provisioning an Auto Scaling Group of EC2 instances, but imagine the size of the instance is fixed, and it's one group for reads and another one for writes. Here's how that capacity translates into actual read and write operations.</p>
<h3 id="heading-capacity-in-provisioned-mode">Capacity in Provisioned Mode</h3>
<p>Capacity is provisioned separately for reads and writes, and it's measured in Capacity Units.</p>
<p><strong>1 Read Capacity Unit (RCU) is equivalent to 1 strongly consistent read of up to 4 KB, per second</strong>. Eventually consistent reads consume half that capacity. Reads over 4 KB consume 1 RCU (1/2 for eventually consistent) per 4 KB, rounded up. That means if you have 5 RCUs, you can perform 10 eventually consistent reads every second, or 2 strongly consistent reads for 7 KB of data each (remember it's rounded up) plus 1 strongly consistent read for 1 KB of data (again, it's rounded up).</p>
<p>Write Capacity Units (WCU) work the same, but for writes. <strong>1 WCU = 1 write per second, of up to 1 KB</strong>. So, with 5 WCUs, you can perform 1 write operation per second of 4.5 KB, or 5 writes of less than 1 KB.</p>
<p>Remember that all operations inside a transaction consume twice the capacity, because DynamoDB uses two-phase commit for transactions. Every node has to simulate the operation and then actually perform it, so it's twice the work.</p>
<p>Also remember that local secondary indexes (LSIs) make each write consume additional capacity: 1 extra write operation for puts or deletes, 2 extra operations per update, with actual WCUs depending on how much data is written on the index (not the base table). Reads on an LSI that query for attributes that aren't projected on the LSI also consume additional read capacity: 1 additional read operation (RCUs for it depend on consistency and size of the data read) for each item that is read from the base table.</p>
<p>If you exceed the capacity (e.g. you have 5 RCUs and in one second you try to do 6 strongly consistent reads), you receive a <code>ProvisionedThroughputExceededException</code>. Your code should catch this and retry. DynamoDB doesn't overload from this, it'll just keep accepting operations up to your capacity and reject the rest (this is called load shedding btw). The AWS SDK already implements retries with exponential backoff, and you can tune the parameters.</p>
<h3 id="heading-tokens-burst-capacity-and-adaptive-capacity">Tokens, Burst Capacity and Adaptive Capacity</h3>
<p>Under the hood, DynamoDB splits the data across several partitions, and capacity is split evenly across those partitions. So, if you set 30 RCUs for a table and it has 3 partitions, each partition gets 10 RCUs. Each partition has a "<strong>token bucket</strong>", which refills at a rate of 1 token per second per RCU (so 10 tokens per second in this case). Each read (strongly consistent, up to 4 KB) consumes 1 token, and if there are no more tokens, you get a <code>ProvisionedThroughputExceededException</code>.</p>
<p>There's two separate buckets, one for reads and one for writes. They both work exactly the same, the only difference is the operations that consume those tokens, and the size of the data (4 KB for reads, 1 KB for writes). I'll talk about RCUs, but the same is true for WCUs.</p>
<p>The tokens bucket has a maximum capacity of 300 * RCUs. For our example of 10 RCUs per partition (remember that each partition has its own bucket), it has a maximum capacity of 3000 tokens, refilling at 10 tokens per second. That means, with no operations going on, it takes 5 minutes to fill up to capacity.</p>
<p>If there's a sudden spike in traffic, these extra tokens that have been piling up will be used to execute those operations, effectively increasing the partition's capacity temporarily. For example, if every user performs 1 strongly consistent read per second on this partition, your RCUs of 10 per partition would serve 10 users. Suppose you don't have users for 5 minutes, the bucket fills up. Then, 20 users come in all of a sudden, making 20 reads per second on this partition. Thanks to those stored tokens, the partition can sustain those 20 reads per second for 5 minutes, even though the partition's RCUs are 10. AWS doesn't handle huge spikes instantaneously (e.g. it won't serve 3000 reads in a second, even if you do have 3000 tokens), but it scales this over a few seconds. This is called <strong>Burst Capacity</strong>, and it's completely separate from Auto Scaling (and will happen even with Auto Scaling disabled).</p>
<p>Another thing that happens with uneven load on partitions is <strong>Adaptive Capacity</strong>. RCUs are split evenly across partitions, so each of our 3 partitions will have 10 RCUs. If partition 1 is the one getting these 20 users, and the others are getting 0, then AWS can assign part of those 20 spare RCUs you have (remember that we set 30 RCUs on the table) to the partition that's handling that load. The maximum RCUs a partition can get is 1.5x the RCUs it normally gets, so in this case it would get 15 RCUs. That means 5 of our 20 spare RCUs are assigned to that partition, and the other 15 are unused capacity. That would let our partition handle those 20 users for 10 minutes instead of 5 (assuming a full token bucket). This also happens separately from Auto Scaling.</p>
<p>Burst Capacity doesn't effectively change RCUs, but it can make our table temporarily behave as if it had more RCUs than it really does, thanks to those stored tokens (very similar to how CPU credits work for EC2 burstable instances). It's great for performance, but I wouldn't count it as scaling (in case you forgot, we're talking about scaling DynamoDB).</p>
<p>Adaptive Capacity can actually increase RCUs beyond what's set for the table. If all partitions are getting requests throttled, Adaptive Capacity will increase their RCUs up to the 1.5 multiplier, even if this puts the total RCUs of the table above the value you set. This will only last for a few seconds, after which it goes back to the normal RCUs, and to throttling requests. I guess that technically counts as scaling the table's RCUs? Yeah, I'll count that as a win for me. Let's get to the real scaling though.</p>
<h3 id="heading-scaling-in-provisioned-mode">Scaling in Provisioned Mode</h3>
<p>This is the real scaling. DynamoDB tables continuously send metrics to CloudWatch, CloudWatch triggers alarms when those metrics cross a certain threshold, DynamoDB gets notified about that and modifies Capacity Units accordingly.</p>
<p>On DynamoDB you enable Auto Scaling, set a minimum and maximum capacity units, and set a target utilization (%). You can enable scaling separately for Reads and Writes.</p>
<p>In the table metrics (handled by CloudWatch) you can view provisioned and consumed capacity, and throttled request count.</p>
<p>Here's the problem though: Auto Scaling is based on CloudWatch Alarms that trigger when the metric is above/below the threshold in at least 3 data points for <strong>5 minutes</strong>. So, not only Auto Scaling doesn't respond fast enough for sudden spikes, it doesn't respond at all if the spikes last less than 5 minutes. That's why the default threshold is 70%, and allowed values are between 20% and 90%: <strong>You need to leave some margin for traffic to continue growing while Auto Scaling takes it sweet time to figure out it should scale.</strong></p>
<p>Luckily, we have Bust Capacity and Adaptive Capacity to deal with those infrequent spikes, and retries can help you eventually serve the requests that were initially throttled. You probably can't retry your way into Auto Scaling (imagine waiting 5 minutes for a request…), but retries can give Burst Capacity the few seconds it needs to kick in. Adaptive Capacity adjusts slower, and it's intended to fix uneven traffic across partitions, so don't count on it.</p>
<p>Now, this is all looking a lot like EC2 instances in an Auto Scaling Group, right? And we're seeing the same problems: <strong>We need to keep some extra capacity provisioned, as a buffer for traffic spikes. Even then, if a spike is big and fast enough, we can't respond to it!</strong> (except for some bursting). Why do we have these problems, if DynamoDB is supposed to be serverless? Well, it is serverless: you don't manage servers, but they're still there. What did you expect, magic? Well, sufficiently advanced science is indistinguishable from magic. Let's see if DynamoDB's other mode is close enough to serverless magic.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h2 id="heading-dynamodb-on-demand-mode">DynamoDB On-Demand Mode</h2>
<p><strong>Welcome to the real serverless mode of DynamoDB!</strong> With On-Demand mode, you <em>don't need to worry about scaling your DynamoDB table, it happens automatically</em>. Wait, you really believed that? Of course you need to worry! But it's much simpler to understand and manage, and it results in less throttles.</p>
<h3 id="heading-capacity-in-on-demand-mode">Capacity in On-Demand Mode</h3>
<p>The cost of reads and writes stays the same: a read operation consumes 1 Read Request Unit (RRU) for every 4 KB read (half if it's eventually consistent), and a write operation consumes 1 WRU (Write Request Unit) for every 1 KB written. Twice for transactions, LSIs increase it, yadda yadda. Same as for Provisioned, we just changed Capacity Units for Request Units.</p>
<p>Here's the difference: There is no capacity you can set. You're billed for every actual operation, and DynamoDB manages capacity automatically and transparently. However, it does have a set capacity, it does scale, and understanding how it does is important.</p>
<h3 id="heading-scaling-in-on-demand-mode">Scaling in On-Demand Mode</h3>
<p>Every newly-created table in On-Demand mode starts with 4.000 WCUs and 12.000 RCUs (yeah, that's a lot). You're not billed for those capacity units though, you'll only be billed for actual operations.</p>
<p>Every time your peak usage goes over 50% of the current assigned capacity, DynamoDB increases the capacity of that table to double your peak. So, suppose you used 5.000 WRUs, now your table's WCUs are 10.000. This growth has a cooldown period of 30 minutes, meaning it won't happen again until 30 minutes after the last increase.</p>
<p>This isn't documented anywhere, and I haven't managed to get official confirmation, but apparently capacity for On-Demand tables doesn't decrease, ever. This seems consistent with how DynamoDB works under the hood: Partitions are split in two and assigned to new nodes, with each node having a certain maximum capacity of 3000 RCUs and 1000 WCUs. Apparently partitions are never re-combined, so there's no reason to think capacity for On-Demand tables would decrease. Again, this isn't published anywhere, just a common assumption.</p>
<h3 id="heading-switching-from-provisioned-mode-to-on-demand-mode">Switching from Provisioned Mode to On-Demand Mode</h3>
<p>You can switch modes in either direction, but you can only do so once every 24 hours. If you switch from Provisioned Mode to On-Demand mode, the table's initial RCUs are the maximum of 12.000, your current RCUs, or double the units of the highest peak. Same for WCUs, the maximum between 4.000, your current WCUs, or double the units of the highest peak.</p>
<p>If you switch from On-Demand mode to Provisioned Mode, you need to set up your capacity or auto scaling manually.</p>
<p>In either case the switch takes up to 30 minutes, during which the table continues to function like before the switch.</p>
<h2 id="heading-provisioned-vs-on-demand-pricing-comparison">Provisioned vs On-Demand - Pricing Comparison</h2>
<p>In Provisioned Mode, like with anything provisioned, you're billed per provisioned capacity, regardless of how much you actually consume. The price is $0.00065/hour per WCU, and $0.00013/hour per RCU.</p>
<p>In On-Demand Mode you're only billed for Request Units (which is basically Capacity Units that were actually consumed). The price is $1.25 per million WRUs and $0.25 per million RRUs.</p>
<p>Let's consider some scenarios. Assume all reads are strongly consistent and read 4 KB of data, and all writes are outside transactions and for 1 KB of data. Also, there are no secondary indexes. Suppose you have the following traffic pattern:</p>
<ul>
<li><p>Between 2000 and 3000 (average 2500) reads per second during 8 hours of the day (business hours).</p>
</li>
<li><p>Between 400 and 600 (average 500) reads per second during 16 hours of the day (off hours).</p>
</li>
<li><p>Between 200 and 300 (average 250) writes per second during 8 hours of the day (business hours).</p>
</li>
<li><p>Between 50 and 150 (average 100) writes per second during 16 hours of the day (off hours).</p>
</li>
</ul>
<p>With <strong>Provisioned Mode, no Auto Scaling</strong>, we'll need to set 3000 RCUs and 300 WCUs. The price would be $0,39 per hour for reads and $0,195 per hour for writes, for a total of $280,80 + $140,40 = <strong>$421,20 per month</strong>.</p>
<p>With <strong>Provisioned Mode, Auto Scaling</strong> set for a minimum 400 and a maximum 3000 RCUs and minimum 50 and maximum 300 WCUs, we'll get the following:<br />For business hours: We'll use our average 2500 reads, so we get $0,325 per hour for reads, and $78/month for reads for business hours. For writes, using our average of 250, we get $0,1625/hour and $39/month.<br />For off hours: Using the average values, $0,065/hour and $31,20/month for reads, and $0,065/hour and $31,20/month for writes,<br />In total, we get <strong>$179,40 per month</strong>.</p>
<p>With <strong>On-Demand Mode</strong>, we'll just use the averages. With 2500 reads per second we have 9.000.000 reads per hour on business hours, which costs us $2,25/hour, or $540/month. We have an average 250 writes per second, so 900.000 writes per hour, which costs $1,125/hour or $270/month.<br />On off hours we have 1.800.000 reads per hour, for $0,45/hour and $216/month. Writes are 360.000/hour, at $0,45/hour and $216/month.<br />Our grand total is $540 + $270 + $216 + $216 = <strong>$1.242/month</strong>.</p>
<p><em>Note: These prices are only for reads and writes. Storage is priced separately, and so are other features like backups.</em></p>
<h2 id="heading-best-practices-for-scaling-dynamodb">Best Practices for Scaling DynamoDB</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><p><strong>Monitor Throttling Metrics:</strong> Keep an eye on the ReadThrottleEvents and WriteThrottleEvents metrics in CloudWatch. Compare them with your app's latency metrics, to determine how much this is impacting your app.</p>
</li>
<li><p><strong>Audit Tables Regularly:</strong> Review your DynamoDB tables to make sure that they're performing and scaling well. This includes reviewing capacity settings, and reviewing indices and keys.</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><strong>Enable Point-In-Time Recovery:</strong> Data corruption doesn't happen, until it happens. Enabling Point-In-Time Recovery allows you to restore a table to a specific state if needed.</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><strong>Pre-Warm On-Demand Tables:</strong> When expecting a big increase in traffic (like a product launch), if you're using an On-Demand table, make sure to pre-warm it. You can do this by switching it to Provisioned Mode, setting its capacity to a large number and keeping it there for a few minutes, and then switching it back to On-Demand so the On-Demand capacity matches the capacity the table had in Provisioned Mode. Remember that you can only switch once every 24 hours.</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><p><strong>Use Auto-Scaling:</strong> This one's quite obvious, I hope. But the point is that there's no reason not to use this. Sure, sometimes it isn't fast enough, but in those cases Provisioned Mode without Auto Scaling won't work well either.</p>
</li>
<li><p><strong>Choose the Right Partition Key:</strong> Remember what I said about capacity units being split across partitions? Well, if you pick a PK that doesn't distribute traffic uniformly (or as uniformly as possible), you're going to have a problem called hot partition. This is part of <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-database-design">DynamoDB Database Design</a>, but as you saw, it affects performance and scaling.</p>
</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><p><strong>Pick The Right Mode:</strong> You saw the numbers in the example. On-Demand will very rarely result in throttling, but it is expensive. Only use it for traffic that spikes in less than the 5 minutes that Provisioned Mode with Auto Scaling takes to scale.</p>
</li>
<li><p><strong>Monitor and Adjust Provisioned Capacity:</strong> Regularly review your capacity settings and adjust them. Traffic patterns change over time!</p>
</li>
<li><p><strong>Use Reserved Capacity:</strong> If you have a consistent and predictable workload (like in the pricing example), consider purchasing reserved capacity for DynamoDB. It works similar to Reserved Instances: You reserve it and commit to a year or 3, for a lower price.</p>
</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Securing the Connection to S3 from EC2]]></title><description><![CDATA[You've deployed your app on an EC2 instance, and there's a file in an S3 bucket that you need to access from the app. You created a public S3 bucket and uploaded the file, and it works! But then you read somewhere that keeping your private files in a...]]></description><link>https://blog.guilleojeda.com/ec2-s3-vpc-endpoint-security</link><guid isPermaLink="true">https://blog.guilleojeda.com/ec2-s3-vpc-endpoint-security</guid><category><![CDATA[AWS]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 07 Sep 2023 15:02:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1694047210968/862e5fed-0cdb-405a-a03d-90d79fc443ee.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You've deployed your app on an EC2 instance, and there's a file in an S3 bucket that you need to access from the app. You created a public S3 bucket and uploaded the file, and it works! But then you read somewhere that keeping your private files in a public S3 bucket is a bad idea, so you set out to fix it.</p>
<h2 id="heading-set-up-a-restrictive-bucket-policy-and-add-a-vpc-endpoint-with-an-endpoint-policy">Set up a restrictive bucket policy and add a VPC endpoint with an Endpoint Policy</h2>
<p><a target="_blank" href="https://github.com/guilleojeda/simpleaws/tree/main/Issue%2337-VPCEndpoints"><strong>Here's the initial setup</strong></a>, and you can deploy it here:</p>
<p><a target="_blank" href="https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://simpleaws-public-cfn-templates.s3.amazonaws.com/Issue37/initial-setup.yml&amp;stackName=SimpleAWS37">Deploy initial setup</a></p>
<p>This is what it looks like before the solution:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694047259763/6c26e547-f713-4a9c-986a-58141b1eff15.png" alt="what it looks like before the solution" class="image--center mx-auto" /></p>
<p>This is what it looks like with the solution:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694047244269/f3ca8315-02e8-4ecc-93ad-0fd36e2f99cb.png" alt="what it looks like with the solution" class="image--center mx-auto" /></p>
<h2 id="heading-step-by-step-instructions-to-secure-the-connection-from-ec2-to-s3">Step by step instructions to secure the connection from EC2 to S3</h2>
<h3 id="heading-step-0-test-that-the-connection-is-working">Step 0: Test that the connection is working</h3>
<ul>
<li><p>Open the CloudFormation console</p>
</li>
<li><p>Select the initial state stack</p>
</li>
<li><p>Click the Outputs tab</p>
</li>
<li><p>Copy the value for EC2InstancePublicIp</p>
</li>
<li><p>Paste it in the browser, append :3000 and hit Enter/Return</p>
</li>
</ul>
<h3 id="heading-step-1-create-a-vpc-endpoint">Step 1: Create a VPC Endpoint</h3>
<ul>
<li><p>Go to the VPC console</p>
</li>
<li><p>In the panel on the left, click Endpoints</p>
</li>
<li><p>Click Create Endpoint</p>
</li>
<li><p>Enter a name</p>
</li>
<li><p>In the Services section, enter S3 in the search box, and select the one that says 'com.amazonaws.your_region.s3' (replace 'your_region' with the region where you deployed the initial setup, which is where the S3 bucket is). Then select the one that says Interface in the Type column.</p>
</li>
</ul>
<ul>
<li><p>For VPC, select SimpleAWSVPC from the dropdown list</p>
</li>
<li><p>Under Subnets, select us-east-1a and us-east-1b, and for each click the dropdown and select the only available subnet</p>
</li>
<li><p>Under Security groups, select the one called VPCEndpointSecurityGroup</p>
</li>
<li><p>Under Policy, pick Full Access for now (we'll change that in Step 2).</p>
</li>
<li><p>Open Additional settings</p>
</li>
<li><p>Check Enable DNS name</p>
</li>
<li><p>Uncheck Enable private DNS only for inbound endpoint</p>
</li>
<li><p>Click Create endpoint</p>
</li>
</ul>
<h3 id="heading-step-2-configure-the-vpc-endpoint-policy">Step 2: Configure the VPC Endpoint Policy</h3>
<ul>
<li><p>In the Amazon VPC console, go to Endpoints</p>
</li>
<li><p>Select the Endpoint you just created</p>
</li>
<li><p>Click the Policy tab</p>
</li>
<li><p>Click Edit Policy</p>
</li>
<li><p>Modify the following JSON by replacing the placeholder values REPLACE_BUCKET_NAME and REPLACE_VPC_ID with the name of your S3 bucket and the ID of SimpleAWSVPC. Then paste it into the Edit Policy page, and click Save.</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-attr">"Statement"</span>: [
        {
            <span class="hljs-attr">"Sid"</span>: <span class="hljs-string">"AllowAccessToSpecificBucket"</span>,
            <span class="hljs-attr">"Principal"</span>: <span class="hljs-string">"*"</span>,
            <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"s3:*"</span>,
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-attr">"Resource"</span>: [
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME"</span>,
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME/*"</span>
            ],
            <span class="hljs-attr">"Condition"</span>: {
                <span class="hljs-attr">"StringEquals"</span>: {
                    <span class="hljs-attr">"aws:sourceVpc"</span>: <span class="hljs-string">"REPLACE_VPC_ID"</span>
                }
            }
        }
    ]
}
</code></pre>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h3 id="heading-step-3-set-up-a-more-restrictive-bucket-policy">Step 3: Set up a more restrictive bucket policy</h3>
<ul>
<li><p>Open the S3 console</p>
</li>
<li><p>Click on the bucket that you created with the initial setup</p>
</li>
<li><p>Click on the Permissions tab</p>
</li>
<li><p>Scroll down to Bucket Policy and click Edit</p>
</li>
<li><p>Paste the following policy, replacing the placeholders REPLACE_BUCKET_NAME and REPLACE_VPC_ENDPOINT_ID with their values (REPLACE_VPC_ENDPOINT_ID is not the same as REPLACE_VPC_ID from the previous step). Then click Save changes</p>
</li>
</ul>
<pre><code class="lang-json">{
    <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
    <span class="hljs-attr">"Id"</span>: <span class="hljs-string">"Policy1415115909153"</span>,
    <span class="hljs-attr">"Statement"</span>: [
        {
            <span class="hljs-attr">"Sid"</span>: <span class="hljs-string">"Access-only-from-SimpleAWSVPC"</span>,
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Deny"</span>,
            <span class="hljs-attr">"Principal"</span>: <span class="hljs-string">"*"</span>,
            <span class="hljs-attr">"Action"</span>: [
                <span class="hljs-string">"s3:PutObject"</span>,
                <span class="hljs-string">"s3:GetObject"</span>
            ],
            <span class="hljs-attr">"Resource"</span>: [
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME"</span>,
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME/*"</span>
            ],
            <span class="hljs-attr">"Condition"</span>: {
                <span class="hljs-attr">"StringNotEquals"</span>: {
                    <span class="hljs-attr">"aws:SourceVpce"</span>: <span class="hljs-string">"REPLACE_VPC_ENDPOINT_ID"</span>
                }
            }
        },
        {
            <span class="hljs-attr">"Sid"</span>: <span class="hljs-string">"Access-from-everywhere"</span>,
            <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
            <span class="hljs-attr">"Principal"</span>: <span class="hljs-string">"*"</span>,
            <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"s3:*"</span>,
            <span class="hljs-attr">"Resource"</span>: [
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME"</span>,
                <span class="hljs-string">"arn:aws:s3:::REPLACE_BUCKET_NAME/*"</span>
            ]
        }
    ]
}
</code></pre>
<h3 id="heading-step-4-test-that-the-connection-is-still-working">Step 4: Test that the connection is still working</h3>
<ul>
<li>Go back to the browser tab where you pasted the public IP address of the instance and refresh the page</li>
</ul>
<h3 id="heading-step-5-empty-the-s3-bucket">Step 5: Empty the S3 bucket</h3>
<p>Before deleting the CloudFormation stack, you'll need to empty the S3 bucket! The Node.js app puts a file in there.</p>
<h2 id="heading-how-does-this-solution-make-the-connection-from-ec2-to-s3-more-secure">How does this solution make the connection from EC2 to S3 more secure?</h2>
<h3 id="heading-vpc-endpoints">VPC Endpoints</h3>
<p>First of all, you'll notice that a VPC Endpoint is for one specific service, S3 in this case. If you wanted to connect to other services you'd need to create a separate VPC Endpoint for each different service.</p>
<p>The second thing you'll notice is that there are 2 types of endpoints: Interface and Gateway. Gateway endpoints are only for S3 and DynamoDB, while Interface endpoints are for nearly everything. Gateway endpoints are simpler, so use them when you can (except if you're writing a newsletter and want to show a few things about Interface endpoints).</p>
<p>Interface endpoints work by creating an Elastic Network Interface in every subnet where you deploy it, and automatically routing to that ENI the traffic that's addressed to the public endpoint of the service. That way, you don't need to make any changes to the code. This only works if you check Enable DNS name.</p>
<h3 id="heading-vpc-endpoint-policies">VPC Endpoint Policies</h3>
<p>The existing policy is a Full Access policy, which is the default policy when a VPC endpoint is created. It allows all actions on the S3 service from anyone.</p>
<p>Instead of that, we're setting up a more restrictive policy, which only allows access to our specific bucket, and denies access to all other buckets.</p>
<p>VPC Endpoint policies are IAM resource policies, and as such, anything that's not explicitly allowed is implicitly denied.</p>
<h3 id="heading-restrictive-s3-bucket-policies">Restrictive S3 bucket policies</h3>
<p>Bucket policies are another type of IAM resource policies. Obviously, this bucket policy will only apply to our S3 bucket. It's important to add it because, while we've restricted what the VPC Endpoint can be used for, the S3 bucket can still be accessed from outside the VPC (e.g. from the public internet). This bucket policy is the one that's going to prevent that, restricting access to only from the VPC Endpoint.</p>
<h2 id="heading-discussing-connection-security-to-s3">Discussing Connection Security to S3</h2>
<p>In this case I kept internet access for the VPC and for the EC2 instance itself, just to make it easier to trigger the code with an HTTP request. This solution is a good idea in these cases because traffic to S3 doesn't go over the public internet, but admittedly, the public internet is a viable alternative.</p>
<p>Where this solution matters more is when you don't have access to the internet. Sure, adding it is rather simple, but you're either exposing yourself unnecessarily by giving your instances a public IP address they don't need, or you're paying for a NAT Gateway. In those cases, VPC Endpoints are a much simpler, safer and cheaper solution.</p>
<p>Conceptually, you can think of this as giving the S3 service a private IP address inside your VPC. In reality, what you're doing is creating a private IP address in your VPC that leads to the S3 service, so that conception is pretty accurate! Behind the scenes (and you can see this easily), the VPC service creates an Elastic Network Interface (ENI) in every subnet where you deploy the VPC Endpoint. Those ENIs will forward the traffic to the S3 service endpoints that are private to the AWS network.</p>
<p>Also, behind the scenes there's a <a target="_blank" href="https://newsletter.simpleaws.dev/p/route-53-private-hosted-zone-dns-endpoint?utm_source=blog&amp;utm_medium=hashnode">Route 53 Private Hosted Zone</a> that you can't see, but which resolves the S3 address to the private IPs of those ENIs, instead of to the public IPs of the public endpoints. That's why you don't need to change the code: Your code depends on the address of the S3 service, and that private hosted zone takes care of resolving it to a different address. You can't see this private hosted zone, it's managed by AWS and hidden from users.</p>
<h2 id="heading-best-practices-for-s3-security">Best Practices for S3 Security</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><strong>Monitor and Alert Endpoint Health:</strong> Monitor the health of your VPC endpoints using CloudWatch metrics. Any unusual activity or degradation in performance should trigger alerts. This could also help you detect a security incident!</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p><strong>Least Privilege Access to Bucket:</strong> This is basically what we did in Step 3: We disabled public access, and implemented a policy that only allows reads from the VPC. Try reading from that S3 bucket from your own computer: <code>aws s3api get-object --bucket 12ewqaewr2qqq --key thankyou.txt thankyou.txt --region us-east-1</code></p>
</li>
<li><p><strong>Regularly Audit IAM Policies:</strong> Regularly review and tighten your IAM policies. Not only for the VPC Endpoint and S3 bucket, but also for the EC2 instance!</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><strong>Use Multiple Subnets in Different AZs:</strong> Each subnet gets one ENI, so if you distribute your subnets in several AZs, your VPC Endpoint is highly available within the region (i.e., it can continue functioning if an Availability Zone fails).</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><strong>Choose the Right VPC Endpoint Type:</strong> Choose the right type of VPC Endpoint based on your workload. For S3, a Gateway Endpoint works best. I'll leave it to you to figure out how to create it (=.</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><strong>Delete Unused VPC Endpoints:</strong> Regularly delete any unused VPC endpoints to avoid paying for stuff you don't use.</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Using SQS to Throttle Writes to DynamoDB]]></title><description><![CDATA[We're running an e-commerce platform, where people publish products and other people purchase those products. Our backend has some highly scalable microservices running on well-designed Lambdas, and there's a lot of caching involved. Our order proces...]]></description><link>https://blog.guilleojeda.com/sqs-throttle-database-writes-dynamodb</link><guid isPermaLink="true">https://blog.guilleojeda.com/sqs-throttle-database-writes-dynamodb</guid><category><![CDATA[AWS]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[DynamoDB]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Thu, 07 Sep 2023 00:12:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1694045404800/89fb9540-0651-4a0a-8515-70d68d1eaacd.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>We're running an e-commerce platform, where people publish products and other people purchase those products. Our backend has some highly scalable <a target="_blank" href="https://newsletter.simpleaws.dev/p/microservices-design?utm_source=blog&amp;utm_medium=hashnode">microservices</a> running on <a target="_blank" href="https://newsletter.simpleaws.dev/p/simple-aws-20-advanced-tips-lambda?utm_source=blog&amp;utm_medium=hashnode">well-designed Lambdas</a>, and there's a lot of caching involved. Our order processing microservice writes to a DynamoDB table we set up following <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-database-design?utm_source=blog&amp;utm_medium=hashnode">How to Design a DynamoDB Database</a>. We're using <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-scaling-provisioned-on-demand?utm_source=blog&amp;utm_medium=hashnode">DynamoDB provisioned capacity mode with auto scaling</a>. We did a great job and everything runs smoothly.</p>
<p>Suddenly, someone's product goes viral, and a lot of people rush in to buy it at the same time. Our cache and CDN don't even blink at the traffic, our well-designed Lambdas scale amazingly fast, but our DynamoDB table is suddenly bombarded with writes and the auto scaling can't keep up. Our order processing Lambda receives ProvisionedThroughputExceededException, and when it retries it just makes everything worse. Things crash. Sales are lost. We eventually recover, but those customers are gone. How do we make sure it doesn't happen again?</p>
<p>Option 1 is to change the DynamoDB table to On-demand, which can keep up with Lambda when scaling, but it's over 5x more expensive. Option 2 is to make sure the table's write capacity isn't exceeded. Let's explore option 2.</p>
<p><strong>AWS Services involved:</strong></p>
<ul>
<li><p><strong>DynamoDB:</strong> Our database. All you need to know for this post is <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-scaling-provisioned-on-demand?utm_source=blog&amp;utm_medium=hashnode">how DynamoDB scales</a>.</p>
</li>
<li><p><strong>SQS:</strong> A fully managed message queuing service that enables you to decouple components. Producers like our order processing microservice post to the queue, the queue stores these messages until they're read, consumers read from the queue in their own time.</p>
</li>
<li><p><strong>SES:</strong> An email platform, more similar to services like MailChimp than to an AWS service. If you're already on AWS and you just need to send emails programmatically, it's easy to set up. If you're not on AWS, need more control, or need to send so many emails that price is a factor, you'll need to do some research. For this post, SES is good enough.</p>
</li>
</ul>
<h2 id="heading-what-is-amazon-sqs">What is Amazon SQS</h2>
<p>SQS is a fully managed message queuing service. A messaging queue is a data structure where items can be read in the same order as they were written: First-In, First-Out (FIFO).</p>
<p>Queues allow us to decouple components by making the consumer (that's the component that reads from the queue) unaware of who wrote the item (the writer is called producer). Additionally, in software architecture we usually focus on another characteristic of queues: The reading of the item can happen some time after the writing. This lets us decouple producers and consumers in time: Consumers don't need to be available when Producers write to the queue. The queue stores the messages for a certain amount of time, and when Consumers are ready, they <em>poll</em> the queue for messages, and receive the oldest message.</p>
<p>For our solution, we're going to use a queue so that our order processing microservice can send a message with the order, the queue stores the message, and a consumer can read it at its own rhythm (i.e. at our DynamoDB table's rhythm).</p>
<h2 id="heading-types-of-sqs-queues">Types of SQS Queues</h2>
<p>There's two types of queues in SQS:</p>
<ul>
<li><p><strong>Standard</strong> queues are the default type of queue. They're cheaper than FIFO queues and nearly-infinitely scalable. The tradeoff is that they only guarantee <strong>at-least-once delivery</strong> (meaning you might get duplicates), and order of the messages is mostly respected but not guaranteed.</p>
</li>
<li><p><strong>FIFO</strong> queues are more expensive than Standard queues, and they don't scale infinitely, but they guarantee ordered, <strong>exactly-once delivery</strong>. You need to set the <code>MessageGroupId</code> property in the message, since FIFO queues only deliver the next message in a MessageGroup after the previous message has been successfully processed. For example, if you set the value of <code>MessageGroupId</code> to the customer ID and a customer makes two orders at the same time, the second one to come in won't be processed until the first one is finished processing. It's also important to set <code>MessageDeduplicationId</code>, to ensure that if the message gets duplicated upstream, it will be deduplicated at the queue. A FIFO queue will only keep one message per unique value of MessageDeduplicationId.</p>
</li>
</ul>
<p>Most people who think of queues are thinking guaranteed FIFO order and exactly-once delivery. The only way to actually get those guarantees is with FIFO queues.</p>
<h2 id="heading-how-to-implement-an-sqs-queue-for-dynamodb">How to Implement an SQS Queue for DynamoDB</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694045866106/e529a08d-c4ac-4e0a-8aa1-2106c92e3104.png" alt="Diagram of a request before and after implementing the solution" class="image--center mx-auto" /></p>
<p>Follow these step by step instructions to implement an SQS Queue to throttle writes to a DynamoDB table. Replace <code>YOUR_ACCOUNT_ID</code> and <code>YOUR_REGION</code> with the appropriate values for your account and region.</p>
<h3 id="heading-create-the-orders-queue">Create the Orders Queue</h3>
<ol>
<li><p>Go to the SQS console.</p>
</li>
<li><p>Click "Create queue"</p>
</li>
<li><p>Choose the "FIFO" queue type (not the default Standard)</p>
</li>
<li><p>In the "Queue name" field enter "OrdersQueue"</p>
</li>
<li><p>Leave the rest as default</p>
</li>
<li><p>Click on "Create queue"</p>
</li>
</ol>
<h3 id="heading-update-the-orders-service-to-write-to-the-sqs-queue">Update the Orders service to write to the SQS queue</h3>
<p>We need to update the code of the Orders service so that it sends the new Order to the Orders Queue, instead of writing to the Orders table. This is what the code looks like:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> AWS = <span class="hljs-built_in">require</span>(<span class="hljs-string">'aws-sdk'</span>);
<span class="hljs-keyword">const</span> sqs = <span class="hljs-keyword">new</span> AWS.SQS();
<span class="hljs-keyword">const</span> queueUrl = <span class="hljs-string">'https://sqs.YOUR_REGION.amazonaws.com/YOUR_ACCOUNT_ID/OrdersQueue'</span>;

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">processOrder</span>(<span class="hljs-params">order</span>) </span>{
  <span class="hljs-keyword">const</span> params = {
    <span class="hljs-attr">MessageBody</span>: <span class="hljs-built_in">JSON</span>.stringify(order),
    <span class="hljs-attr">QueueUrl</span>: queueUrl,
    <span class="hljs-attr">MessageGroupId</span>: order.customerId,
    <span class="hljs-attr">MessageDeduplicationId</span>: order.id
  };

  <span class="hljs-keyword">try</span> {
    <span class="hljs-keyword">const</span> result = <span class="hljs-keyword">await</span> sqs.sendMessage(params).promise();
    <span class="hljs-built_in">console</span>.log(<span class="hljs-string">'Order sent to SQS:'</span>, result.MessageId);
  } <span class="hljs-keyword">catch</span> (error) {
    <span class="hljs-built_in">console</span>.error(<span class="hljs-string">'Error sending order to SQS:'</span>, error);
  }
}
</code></pre>
<p>Also, add this policy to the IAM Role of the function, so it can access SQS. Don't forget to delete the permissions to access DynamoDB!</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
  <span class="hljs-attr">"Statement"</span>: [
    {
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"sqs:SendMessage"</span>,
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"</span>
    }
  ]
}
</code></pre>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h3 id="heading-set-up-ses-to-notify-the-customer-via-email">Set up SES to notify the customer via email</h3>
<ol>
<li><p>Open the SES console</p>
</li>
<li><p>Click on "Domains" in the left navigation pane</p>
</li>
<li><p>Click "Verify a new domain"</p>
</li>
<li><p>Follow the on-screen instructions to add the required DNS records for your domain.</p>
</li>
<li><p>Alternatively, click on "Email Addresses" and then click the "Verify a new email address" button. Enter the email address you want to verify and click "Verify This Email Address". Check your inbox and click the link.</p>
</li>
</ol>
<h3 id="heading-set-up-the-order-processing-service">Set up the Order Processing service</h3>
<p>Go to the Lambda console and create a new Lambda function. Add the following code:</p>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> AWS = <span class="hljs-built_in">require</span>(<span class="hljs-string">'aws-sdk'</span>);
<span class="hljs-keyword">const</span> dynamoDB = <span class="hljs-keyword">new</span> AWS.DynamoDB.DocumentClient();
<span class="hljs-keyword">const</span> ses = <span class="hljs-keyword">new</span> AWS.SES();

<span class="hljs-built_in">exports</span>.handler = <span class="hljs-keyword">async</span> (event) =&gt; {
    <span class="hljs-keyword">for</span> (<span class="hljs-keyword">const</span> record <span class="hljs-keyword">of</span> event.Records) {
        <span class="hljs-keyword">const</span> order = <span class="hljs-built_in">JSON</span>.parse(record.body);
        <span class="hljs-keyword">await</span> saveOrderToDynamoDB(order);
        <span class="hljs-keyword">await</span> sendEmailNotification(order);
    }
};

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">saveOrderToDynamoDB</span>(<span class="hljs-params">order</span>) </span>{
    <span class="hljs-keyword">const</span> params = {
        <span class="hljs-attr">TableName</span>: <span class="hljs-string">'orders'</span>,
        <span class="hljs-attr">Item</span>: order
    };

    <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">await</span> dynamoDB.put(params).promise();
        <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Order saved: <span class="hljs-subst">${order.orderId}</span>`</span>);
    } <span class="hljs-keyword">catch</span> (error) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">`Error saving order: <span class="hljs-subst">${order.orderId}</span>`</span>, error);
    }
}

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">sendEmailNotification</span>(<span class="hljs-params">order</span>) </span>{
    <span class="hljs-keyword">const</span> emailParams = {
        <span class="hljs-attr">Source</span>: <span class="hljs-string">'you@simpleaws.dev'</span>,
        <span class="hljs-attr">Destination</span>: {
            <span class="hljs-attr">ToAddresses</span>: [order.customerEmail]
        },
        <span class="hljs-attr">Message</span>: {
            <span class="hljs-attr">Subject</span>: {
                <span class="hljs-attr">Data</span>: <span class="hljs-string">'Your order is ready'</span>
            },
            <span class="hljs-attr">Body</span>: {
                <span class="hljs-attr">Text</span>: {
                    <span class="hljs-attr">Data</span>: <span class="hljs-string">`Thank you for your order, <span class="hljs-subst">${order.customerName}</span>! Your order #<span class="hljs-subst">${order.orderId}</span> is now ready.`</span>
                }
            }
        }
    };

    <span class="hljs-keyword">try</span> {
        <span class="hljs-keyword">await</span> ses.sendEmail(emailParams).promise();
        <span class="hljs-built_in">console</span>.log(<span class="hljs-string">`Email sent: <span class="hljs-subst">${order.orderId}</span>`</span>);
    } <span class="hljs-keyword">catch</span> (error) {
        <span class="hljs-built_in">console</span>.error(<span class="hljs-string">`Error sending email for order: <span class="hljs-subst">${order.orderId}</span>`</span>, error);
    }
}
</code></pre>
<p>Also, add the following IAM Policy to the IAM Role of the function, so it can be triggered by SQS and access DynamoDB and SES:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"Version"</span>: <span class="hljs-string">"2012-10-17"</span>,
  <span class="hljs-attr">"Statement"</span>: [
    {
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: [
        <span class="hljs-string">"sqs:ReceiveMessage"</span>,
        <span class="hljs-string">"sqs:DeleteMessage"</span>,
        <span class="hljs-string">"sqs:GetQueueAttributes"</span>
      ],
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"arn:aws:sqs:YOUR_REGION:YOUR_ACCOUNT_ID:OrdersQueue"</span>
    },
    {
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: [
        <span class="hljs-string">"dynamodb:PutItem"</span>,
        <span class="hljs-string">"dynamodb:UpdateItem"</span>,
        <span class="hljs-string">"dynamodb:DeleteItem"</span>
      ],
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"arn:aws:dynamodb:YOUR_REGION:YOUR_ACCOUNT_ID:table/Orders"</span>
    },
    {
      <span class="hljs-attr">"Effect"</span>: <span class="hljs-string">"Allow"</span>,
      <span class="hljs-attr">"Action"</span>: <span class="hljs-string">"ses:SendEmail"</span>,
      <span class="hljs-attr">"Resource"</span>: <span class="hljs-string">"*"</span>
    }
  ]
}
</code></pre>
<h3 id="heading-make-the-orders-queue-trigger-the-order-processing-service">Make the Orders Queue trigger the Order Processing service</h3>
<ol>
<li><p>In the Lambda console, go to the Order Processing lambda</p>
</li>
<li><p>In the "Function overview" section, click "Add trigger"</p>
</li>
<li><p>Click "Select a trigger" and choose "SQS"</p>
</li>
<li><p>Select the Orders Queue</p>
</li>
<li><p>Set Batch size to 1</p>
</li>
<li><p>Make sure that the "Enable trigger" checkbox is checked</p>
</li>
<li><p>Click "Add"</p>
</li>
</ol>
<h3 id="heading-limit-concurrent-executions-of-the-order-processing-lambda">Limit concurrent executions of the Order Processing Lambda</h3>
<ol>
<li><p>In the Lambda console, go to the Order Processing lambda</p>
</li>
<li><p>Scroll down to the "Concurrency" section</p>
</li>
<li><p>Click "Edit"</p>
</li>
<li><p>In the "Provisioned Concurrency" section, set "Reserved Concurrency" to 10</p>
</li>
<li><p>Click "Save"</p>
</li>
</ol>
<h2 id="heading-synchronous-and-asynchronous-workflows-with-sqs">Synchronous and Asynchronous Workflows with SQS</h2>
<p>Architecture-wise, there's one big change in our solution: We've made our workflow <strong>async</strong>! Let me bring the diagram here.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1694045884535/8f5cf7fd-f573-4636-8a23-675eddee4da0.png" alt="Diagram of a request before and after implementing the solution" class="image--center mx-auto" /></p>
<p>Before, our Orders service would return the result of the order. From the user's perspective, they <strong>wait until the order is processed</strong>, and they see the result on the website. From the system's perspective, we're constrained to either succeed or fail processing the order in the timeout limit of API Gateway (29 seconds). In more practical terms, we're limited by what the user is expecting: we can't just show a "loading" icon for 29 seconds!</p>
<p>After the change, the website just shows something like "We're processing your order, we'll email you when it's ready". That sets a <strong>different expectation</strong> to the user. That's important for the system, because now we could actually have our Lambda function take 15 minutes, without hitting the 29 seconds limit of API Gateway, or without the user getting angry. It's not just that though, if the Order Processing lambda crashes mid-execution, the SQS queue will make the order available again as a message after the visibility timeout expires, and the Lambda service will invoke our function again with the same order. When the maxReceiveCount limit is reached, the order can be sent to another queue called Dead Letters Queue (DLQ), where we can store failed orders for future reference. We didn't set up a DLQ here, but it's easy enough, and for small and medium-sized systems you can easily set up SNS to send you an email and resolve the issue manually, since the volume shouldn't be particularly large.</p>
<p>Once the order went through all the steps, failed some, retried, succeeded, etc, then we notify the user that their order is "ready". This can look different for different systems, some are just a "we got the money", some ship physical products, some onboard the user to a complex SaaS. For this solution I chose to do it via email because it's easy and common enough, but you could use a webhook and still keep the process async.</p>
<h2 id="heading-best-practices-for-sqs-and-dynamodb">Best Practices for SQS and DynamoDB</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><p><strong>Monitor and set alarms:</strong> You know how to monitor Lambdas. You can monitor SQS queues as well! An interesting alarm to set here would be number of orders in the queue, so our customers don't wait too long for their orders to be processed.</p>
</li>
<li><p><strong>Handle errors and retries:</strong> Be ready for anything to fail, and architect accordingly. Set up a DLQ, set up notifications (to you and to the user) for when things fail, and above all don't lose/corrupt data.</p>
</li>
<li><p><strong>Set up tracing:</strong> We're complicating things a bit (hopefully for a good reason). We can gain better visibility into that complexity by <a target="_blank" href="https://newsletter.simpleaws.dev/p/using-aws-xray-observability-eventdriven-architectures?utm_source=blog&amp;utm_medium=hashnode">setting up X-Ray</a>.</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p><strong>Check "Enable server-side encryption":</strong> That's all you need to do for an SQS queue to be encrypted at rest: check that box, and pick a KMS key. SQS communicates over HTTPS, so you already have encryption in transit.</p>
</li>
<li><p><strong>Tighten permissions:</strong> The IAM policies in this issue are pretty restrictive. But there's always a nut to tighten, so keep your eyes open.</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><p><strong>Set up maxReceiveCount and a DQL:</strong> With a FIFO queue, the next message won't be available for processing until the previous one is either processed successfully or dropped (to the DLQ if you set one) after maxReceiveCount attempts. If you don't set these, one corrupted order will block your whole system.</p>
</li>
<li><p><strong>Set visibility timeout:</strong> This is the time that SQS waits without receiving the "success" response, before assuming the message wasn't processed successfully and making it available again for the next consumer. Set a reasonable value, and set the same value as a timeout for your consumer (Order Processing lambda in this case).</p>
</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><p><strong>Optimize Lambda function memory:</strong> More memory means more money. But it also means faster processing. Going from 30 to 25 seconds won't matter much for a successfully processed order, but if orders are retried 5 times, now it's 25 seconds we're gaining instead of 5. Could be worth it, depending on your customers' expectations.</p>
</li>
<li><p><strong>Use Batch processing:</strong> As discussed earlier, you should consider processing messages in batches.</p>
</li>
<li><p><strong>Remember</strong> <a target="_blank" href="https://newsletter.simpleaws.dev/p/simple-aws-20-advanced-tips-lambda?utm_source=blog&amp;utm_medium=hashnode"><strong>the 20 advanced tips for Lambda</strong></a><strong>.</strong></p>
</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><p><strong>Provisioned vs. On-demand for DynamoDB:</strong> Remember that this could be fixed by using our DynamoDB table in On-demand mode. It's 5x more expensive though. Same goes for relational databases (if we use Aurora, then Aurora Serverless is an option).</p>
</li>
<li><p><strong>Consider something other than Lambda:</strong> In this case, we're trying to get all orders processed relatively fast. If the processing can wait a bit more, an auto scaling group that scales based on the number of messages in the SQS queue can work wonders, for a lot less money.</p>
</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Amazon EBS Basics and Best Practices]]></title><description><![CDATA[Elastic Block Store (EBS for short) is a block-level storage service for EC2 instances. Essentially it's a virtual SSD or HDD that you attach to EC2 instances, so they can have persistent storage. Honestly, EBS is pretty boring to talk about, but if ...]]></description><link>https://blog.guilleojeda.com/ebs-basics-best-practices</link><guid isPermaLink="true">https://blog.guilleojeda.com/ebs-basics-best-practices</guid><category><![CDATA[AWS]]></category><category><![CDATA[storage]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Beginner Developers]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Wed, 30 Aug 2023 16:15:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1693412064654/97353276-25ba-49b4-a93f-348e4299e5db.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html">Elastic Block Store</a> (EBS for short) is a block-level storage service for EC2 instances. Essentially it's a virtual SSD or HDD that you attach to EC2 instances, so they can have persistent storage. Honestly, EBS is pretty boring to talk about, but if you're storing a ton of data, knowing the fine details can save you a lot of money. Let's start with the basics.</p>
<h2 id="heading-ebs-basic-concepts">EBS Basic Concepts</h2>
<h3 id="heading-what-is-an-ebs-volume">What is an EBS Volume</h3>
<p>An EBS volume is a virtual block-level storage device that can be used by EC2 instances to store persistent data. An EBS volume acts like a HDD or SSD, but behind the scenes they're actually an array of physical discs in a RAID configuration, in the same datacenter but physically distanced from each other to minimize the probability of simultaneous failures (e.g. due to fires).</p>
<p>When you create an EBS volume you can define the size, the performance (only for some volume types), and the volume type, which is explained in the next section. Size and performance can be changed later, but not volume type.</p>
<p>EBS volumes exist separate from EC2 instances, and can be detached to one instance and attached to another one. When you create an EC2 instance, an EBS volume is created and attached as a root volume, and the default behavior is to delete it when the instance is terminated (this can be changed). But it's important to understand that they are actually a separate service from EC2.</p>
<h2 id="heading-types-of-ebs-volumes">Types of EBS Volumes</h2>
<p>As is often the case in AWS, there's different types of resources to serve different use cases and needs. These are the types of volumes you can create in EBS. Remember that you set this on creation and cannot change it later.</p>
<h3 id="heading-ebs-gp3-volumes">EBS GP3 Volumes</h3>
<p>This is the general-purpose volume type, which you should use for most stuff, and default to when in doubt. Size and performance (IOPS) can be configured separately (unlike the previous generation, GP2). Here are some details:</p>
<ul>
<li><p>Volume Size: 1 GB to 16 TB</p>
</li>
<li><p>Durability: 99.8% to 99.9%</p>
</li>
<li><p>Max IOPS/Volume: 16,000 (operations of 16K)</p>
</li>
<li><p>Max Throughput/Volume: 1000 MB/s</p>
</li>
<li><p>Latency: single digit milliseconds</p>
</li>
<li><p><strong>Price:</strong></p>
<ul>
<li><p><strong>$0.08/GB-month</strong></p>
</li>
<li><p><strong>$0.005/provisioned IOPS-month</strong> over 3,000 (the first 3,000 are free)</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-ebs-io2-volumes">EBS IO2 Volumes</h3>
<p>GP3 is great for most use cases, but if you need more performance, you go with IO2. If you're running a database on an EC2 instance, this is the one to pick. Size and performance (IOPS) can be configured separately, and the limits are higher than for GP3. Details:</p>
<ul>
<li><p>Volume Size: 4 GB to 16 TB</p>
</li>
<li><p>Durability: 99.999%</p>
</li>
<li><p>Max IOPS/Volume: 64,000 (operations of 16K). 256,000 IOPS with <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisioned-iops.html#io2-block-express">io2 Block Express</a>.</p>
</li>
<li><p>Max Throughput/Volume: 1,000 MB/s. 4,000 MB/s with <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/provisioned-iops.html#io2-block-express">io2 Block Express</a>.</p>
</li>
<li><p>Latency: single digit millisecond</p>
</li>
<li><p><strong>Price:</strong></p>
<ul>
<li><p><strong>$0.125/GB-month</strong></p>
</li>
<li><p><strong>$0.065/provisioned IOPS-month up to 32,000 IOPS (no free IOPS)</strong></p>
</li>
<li><p><strong>$0.046/provisioned IOPS-month from 32,001 to 64,000 IOPS</strong></p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-ebs-st1-volumes">EBS ST1 Volumes</h3>
<p>This one is actually an HDD (virtual, but backed by real HDDs). Spinning disks! The use case is sequential access to data that sits contiguously in the physical disks. In general, that means data that is written as a long stream, and then read as a long stream, instead of having different parts accessed at random. It offers pretty good performance for that compared to GP3, at almost half the price. Here are the specs:</p>
<ul>
<li><p>Volume Size: 125 GB to 16 TB</p>
</li>
<li><p>Durability: 99.8% to 99.9% durability</p>
</li>
<li><p>Max IOPS/Volume: 500 (operations of 1 MB, not 16K)</p>
</li>
<li><p>Max Throughput/Volume: 500 MB/s</p>
</li>
<li><p><strong>Price: $0.045/GB-month</strong></p>
</li>
</ul>
<p>Maximum performance varies per size, at 40 MB/s per TB. Additionally, ST1 Volumes use a burst credits system to accumulate credits while usage is below peak throughput, and consume them to achieve for a period of time a higher throughput than the soft limit. In short, a 12.5 TB volume always performs at max 500 MB/s, and any volume below that has a lower soft limit, but can reach 500 MB/s for a short period of time. This is better explained <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/hdd-vols.html#EBSVolumeTypes_st1">here</a>.</p>
<p>Certification Exam tip: ST1 volumes can't be root volumes.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h3 id="heading-ebs-sc1-volumes">EBS SC1 Volumes</h3>
<p>SC1 volumes are also HDDs, aimed at offering the lowest price per GB of all block storage options. They're recommended for infrequently accessed data that needs to be accessed from block storage (the C in SC1 stand for Cold). If you don't strictly need block storage access, S3 Infrequent Access or Glacier are also viable options.</p>
<ul>
<li><p>Volume Size: 125 GB to 16 TB</p>
</li>
<li><p>Durability: 99.8% to 99.9% durability ← Much less than S3!</p>
</li>
<li><p>Max IOPS/Volume: 250 (operations of 1 MB, not 16K)</p>
</li>
<li><p>Max Throughput/Volume: 250 MB/s</p>
</li>
<li><p><strong>Price: $0.015/GB-month</strong> ← Slightly higher than S3 Infrequent Access, but with EBS SC1 you're not billed for reads and writes</p>
</li>
</ul>
<p>SC1 volumes use the same burst credit system as ST1, though their performance is lower.</p>
<p>Certification Exam tip: SC1 volumes can't be root volumes.</p>
<h3 id="heading-older-volume-types">Older Volume Types</h3>
<p>IO1 (predecessor of IO2) and GP2 (predecessor of GP3) were the norm a few years ago, and you'll most likely find some still in production. IO1 works just like IO2, but with less durability, and a higher price at high IOPS. GP2 has the same use cases than GP3, but performance was tied to volume size and it used a burst credits system, like ST1. It's also 25% more expensive than GP3.</p>
<p>Cost-savings tip: Migrate GP2 volumes to GP3.</p>
<h2 id="heading-characteristics-of-ebs-volumes">Characteristics of EBS Volumes</h2>
<h3 id="heading-ec2-root-volume">EC2 Root Volume</h3>
<p>An EC2 instance comes with an EBS volume associated with it, called the root volume. This volume contains the OS, some libraries and programs, and some configurations. This is where the EC2 instance boots from when starting up.</p>
<h3 id="heading-multiple-ebs-volumes">Multiple EBS Volumes</h3>
<p>You can attach up to 128 EBS volumes to an instance (depending on instance type), so long as they're in the same Availability Zone as the EC2 instance. Once a volume is attached, you'll see it in the OS just like if you were physically attaching a disk, and you can mount it on the OS or file system.</p>
<p>You can detach an EBS volume from an instance and attach it to another instance as many times as you want, so long as both instances are in the same Availability Zone as the EC2 instance.</p>
<p>Volumes of any type other than IO1 or IO2 can only be attached to once instance at a time. IO1 and IO2 volumes can be attached to multiple instances at the same time, in read-write mode.</p>
<h3 id="heading-availability-of-ebs-volumes">Availability of EBS Volumes</h3>
<p>EBS volumes are zonal resources. They exist in a single Availability Zone, so they are <strong>not highly available</strong>. This is also true for IO2 volumes, which offer durability of 99.999%.</p>
<p>EBS volumes are redundant within that Availability Zone, so data loss is significantly less likely than with a single disk. They're backed by an array of physical disks in a RAID configuration.</p>
<h2 id="heading-lifecycle-of-an-ebs-volume">Lifecycle of an EBS Volume</h2>
<p>It's important to understand that the lifecycle of an EBS volume is separate from that of the EC2 instance. You can create them, attach them, detach them and delete them on their own. You can also configure them to be deleted when the EC2 instance they're attached to is terminated, which is the default for the root volume.</p>
<h3 id="heading-encryption-of-ebs-volumes">Encryption of EBS Volumes</h3>
<p>EBS volumes can be <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html">encrypted using KMS</a>, in a way that's entirely transparent to you. You don't need to manage or use any encryption keys, the EBS service automatically fetches them and decrypts the data when you initiate a read operation.</p>
<p>You can enable <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html#encryption-by-default">Encryption by Default</a> on your AWS account, which means every new EBS volume you create will be encrypted unless you explicitly configure it not to. This is a highly recommended best practice for AWS security.</p>
<h3 id="heading-backing-up-ebs-volumes-ebs-snapshots">Backing up EBS Volumes: EBS Snapshots</h3>
<p>EBS snapshots are point-in-time copies of EBS volumes, used to back up and restore data. They're incremental, which means they only capture the data that has changed since the last snapshot. This makes EBS snapshots more efficient and cost-effective than full-volume backups. The size of an EBS snapshot is calculated based on the amount of data stored in the volume at the time the snapshot was taken.</p>
<p>Snapshots are regional resources, meaning you can use them to hold a copy of an EBS volume and, if the volume's Availability Zone fails, restore it from the snapshot in a different Availability Zone. They can also be shared across regions and AWS accounts. Here's where you can read more about <a target="_blank" href="https://blog.guilleojeda.com/automating-ebs-snapshots-for-disaster-recovery-guide?utm_source=blog&amp;utm_medium=hashnode">automating EBS snapshots for Disaster Recovery</a>.</p>
<h2 id="heading-ebs-best-practices">EBS Best Practices</h2>
<h3 id="heading-default-to-ebs-gp3-volumes">Default to EBS GP3 Volumes</h3>
<p>You should default to GP3 volumes, and only use the other volume types if you have a specific use case, or you know you need more performance. Here's a <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_procedures.html">guide to benchmark EBS volumes</a>.</p>
<h3 id="heading-use-ebs-optimized-instances-for-higher-performance">Use EBS-optimized instances for higher performance</h3>
<p>EC2 Instance families have a limit on performance with EBS volumes, which is independent of the EBS volume itself. If you need high performance, it may not be enough to just use a better EBS volume such as IO2. You'll also need to look into whether your EC2 instances support that level of performance, and possibly use <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html#current-storage-optimized">EC2 instances that are Storage Optimized</a>.</p>
<p>EBS performance is also limited by instance size. You can use the <code>EBSIOBalance%</code> and <code>EBSByteBalance%</code> metrics in CloudWatch to help you determine whether your instances are sized correctly. Instances with a consistently low balance percentage should be increased in size, and instances where the balance percentage never drops below 100% should be reduced in size.</p>
<h3 id="heading-use-ec2-instance-store-for-extreme-performance">Use EC2 Instance Store for extreme performance</h3>
<p>If you need extreme performance, you'll need to use <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html">EC2 Instance Store</a>. It's ephemeral (that means non-permanent) block storage with a much higher performance than EBS. The main disadvantages are the pricing (you need an EC2 instance of a special family, which isn't cheap) and the fact that data is lost if the instance is stopped or terminated.</p>
<h3 id="heading-encrypt-your-ebs-volumes">Encrypt your EBS Volumes</h3>
<p>This comes at no cost and no performance hit to you, so it should be a no brainer. First, you should enable <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html#encryption-by-default">Encryption by Default</a>, so all future EBS volumes are created with encryption. Then you should encrypt existing EBS volumes by creating a snapshot of them, encrypting that snapshot and creating a new volume from the encrypted snapshot.</p>
<h3 id="heading-migrate-gp2-volumes-to-gp3">Migrate GP2 volumes to GP3</h3>
<p>GP3 volumes can do anything that GP2 volumes can, and they're 20% cheaper. Here's a guide to <a target="_blank" href="https://aws.amazon.com/blogs/storage/migrate-your-amazon-ebs-volumes-from-gp2-to-gp3-and-save-up-to-20-on-costs/">migrate your existing GP2 volumes to GP3</a>.</p>
<h3 id="heading-back-up-important-ebs-volumes">Back up important EBS volumes</h3>
<p>Remember that EBS volumes are zonal resources, meaning that if an Availability Zone goes offline you won't be able to access them. Furthermore, most EBS volume types offer 99.8% durability, making data loss or corruption entirely possible. To guard against that, create Snapshots of your EBS volumes.</p>
<p>Snapshots are regional, so you're good if an AZ goes down. But if you want to do cross-region disaster recovery, you'll need to export the snapshots to another AWS region. If you're exporting encrypted snapshots, use a multi-region KMS key.</p>
<p>You can <a target="_blank" href="https://blog.guilleojeda.com/automating-ebs-snapshots-for-disaster-recovery-guide?utm_source=blog&amp;utm_medium=hashnode">use Data Lifecycle Manager to automate creating and exporting EBS snapshots</a>.</p>
<p>Accessing a block for the first time on an EBS volume created from a snapshot has huge latency, because the data is lazy loaded from S3. To avoid this, you can initialize (pre-warm) the volume before putting the volume in production, by accessing each block once. You can do this on Linux by attaching the volume to an EC2 instance, installing the <code>fio</code> utility, and running the following program (example for a volume called <code>xvdf</code>)</p>
<pre><code class="lang-bash">sudo fio --filename=/dev/xvdf --rw=<span class="hljs-built_in">read</span> --bs=1M --iodepth=32 --ioengine=libaio --direct=1 --name=volume-initialize
</code></pre>
<p>Another option is to enable <a target="_blank" href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-fast-snapshot-restore.html">EBS fast snapshot restore</a> on the snapshot. This will ensure AWS does the initialization for you, and is much faster. However, it costs <strong>$540/month</strong> per snapshot (regardless of size), so I prefer the manual option.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item><item><title><![CDATA[Microservices in AWS: Migrating from a Monolith]]></title><description><![CDATA[The first rule about microservices is that you don't need microservices (for 99% of applications). They were invented as a REST-based implementation of Service-Oriented Architectures, which is an XML-based Enterprise Architecture pattern so complex t...]]></description><link>https://blog.guilleojeda.com/microservices-in-aws-migrating-from-a-monolith</link><guid isPermaLink="true">https://blog.guilleojeda.com/microservices-in-aws-migrating-from-a-monolith</guid><category><![CDATA[AWS]]></category><category><![CDATA[Microservices]]></category><category><![CDATA[architecture]]></category><category><![CDATA[software architecture]]></category><dc:creator><![CDATA[Guillermo Ojeda]]></dc:creator><pubDate>Tue, 29 Aug 2023 01:36:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1693267409159/8a96dc1f-19d0-47dc-9a5a-d4705258d438.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The first rule about microservices is that <strong>you don't need microservices</strong> (for 99% of applications). They were invented as a REST-based implementation of Service-Oriented Architectures, which is an XML-based Enterprise Architecture pattern so complex that XML is the easiest part.</p>
<p>At some point, microservices became this really cool thing that all the cool kids were doing for street cred. "Netflix does it, so if I do it I'll be as cool as Netflix!" Folks, it doesn't work like that.</p>
<h2 id="heading-what-are-microservices">What Are Microservices?</h2>
<p>A microservice is a service in a software application which encapsulates a bounded context (including the data), can be built, deployed and scaled independently, and exposes functionality through a clearly defined API.</p>
<p>Let's expand a bit on each characteristic:</p>
<ul>
<li><p><strong>Bounded context:</strong> The concept stems from Domain-Driven Design's <a target="_blank" href="https://martinfowler.com/bliki/BoundedContext.html">bounded contexts</a>, and essentially advocates for dividing the entire domain into several smaller domains. Microservices takes it a step further: each microservice is part of only one bounded context, and owns that context, including all domain entities, all data, and all operations and functionality. Anything that needs to access that bounded context needs to do so through the microservice's API. Conceptually, it's similar to encapsulation in Object-Oriented Programming, but on a higher level.</p>
</li>
<li><p><strong>Built, deployed and scaled independently:</strong> Each microservice is independent in every sense. It can be built by a separate team using different technologies, it has its own deployment pipeline and process, and it can be scaled independently of the rest of the system. This provides a clear separation between what the microservice does and how it does it, and gives you a lot of flexibility on that how.</p>
</li>
<li><p><strong>Clearly defined API:</strong> One microservice can't solve everything, or it would be just a monolith. You need several, and you need to combine them to realize the entire system's functionality. That means, you need a clear and unambiguous way to communicate with a microservice. The API is the interface of the microservice, and it's the only thing that other components of the system can access. This minimizes dependencies between microservices, letting them be implemented and evolved separately, so long as they adhere to their API. Keep in mind that, since each microservice owns its data, the only way to access the data of a microservice is through that microservice's API. No reading directly from another microservice's database!</p>
</li>
</ul>
<h2 id="heading-why-use-microservices">Why Use Microservices?</h2>
<p>Microservices exist to solve a specific problem: problems in complex domains require complex solutions, which become unmanageable due to the size and complexity of the domain itself. Microservices (when done right) split that complex domain into simpler domains, encapsulating the complexity and reducing the scope of changes.</p>
<p>Microservices also add complexity to the solution, because now you need to figure out where to draw the boundaries of the domains, and how the microservices interact with each other, both at the domain level (complex actions that span several microservices) and at the technical level (service discovery, networking, permissions).</p>
<p><strong>So, when do you need microservices? When the reduction in complexity of the domain outweighs the increase in complexity of the solution.</strong></p>
<p><strong>When do you not need microservices? When the domain is not that complex.</strong> In that case, use regular services, where the only split is in the behavior (i.e. backend code). Or stick with a monolith, Facebook does that and it works pretty well, at a size we can only dream of.</p>
<h2 id="heading-types-of-microservices">Types of Microservices</h2>
<p>There's two ways in which you can split your application into microservices:</p>
<ul>
<li><p><strong>Vertical slices:</strong> Each microservice solves a particular use case or a set of tightly-related use cases. You add services as you add features, and each user interaction goes through the minimum possible number of services (ideally only 1). This means features are an aspect of decomposition. Code reuse is achieved through shared libraries, and cross-service responsibilities are implemented on support microservices. This results in architectures very similar to SOA.</p>
</li>
<li><p><strong>Functional services:</strong> Each service handles one particular step, integration, state, or thing. System behavior is an emergent property, resulting from combining different services in different ways. Each user interaction invokes multiple services. New features don't need entirely new services, just new combinations of services. Features are an aspect of integration, not decomposition. Code reuse is often achieved through invoking another service. This is often much harder to do, both because of the difficulty in translating use cases into reusable steps, and because you need a lot of complex distributed transactions.</p>
</li>
</ul>
<p>Overall, vertical slices is easier to understand, and easier to implement for smaller systems. The drawback is that if your system does 200 different things, you'll need 200 services, plus support services and libraries. Functional services are harder to conceptualize, and it's not uncommon to end up with a ton of microservices that have 50 lines of code and don't own any data. If that's your case, you're doing it wrong. Remember that the split should be at the domain level, not at the code level. It's perfectly ok for a microservice to be implemented with several services!</p>
<p>Don't combine these two types of microservices! If you're doing vertical slices, support microservices should be only for non-business behavior, such as logging. If you're doing functional microservices, don't create a service that just orchestrates calls between other microservices; either use an orchestrator for all transactions, or choreograph them. And don't even think about migrating from one type of microservices to the other one. It's much, much easier to just drop the whole system and start from scratch.</p>
<h2 id="heading-splitting-a-monolith-into-microservices">Splitting a Monolith into Microservices</h2>
<p>Let's see microservices in a real example. Picture the following scenario: We have an online learning platform built as a monolithic application, which enables users to browse and enroll in a variety of courses, access course materials such as videos, quizzes, and assignments, and track their progress throughout the courses. The application is <a target="_blank" href="https://newsletter.simpleaws.dev/p/migrate-nodejs-app-from-ec2-to-scalable-ecs-guide?utm_source=blog&amp;utm_medium=hashnode">deployed on Amazon ECS</a> as a single service that's scalable and highly available.</p>
<p>As the app grew, we've noticed that content delivery becomes a bottleneck during normal operations. Additionally, changes in the course directory resulted in some bugs in progress tracking. To deal with these issues, we decided to split the app into three microservices: Course Catalog, Content Delivery, and Progress Tracking.</p>
<p><strong>Out of scope (so we don't lose focus):</strong></p>
<ul>
<li><p><strong>Authentication/authorization:</strong> When I say “users” I mean authenticated users. We could <a target="_blank" href="https://newsletter.simpleaws.dev/p/securing-microservices-aws-cognito?utm_source=blog&amp;utm_medium=hashnode">use Cognito to secure access to microservices</a>, but let's focus on designing the microservices first.</p>
</li>
<li><p><strong>User registration and management:</strong> Same as above.</p>
</li>
<li><p><strong>Payments:</strong> Since our courses are so awesome, we should charge for them. We could use a separate microservice that integrates with a payment processor such as Stripe.</p>
</li>
<li><p><strong>Caching and CDN:</strong> We should use CloudFront to cache the content, to reduce latency and costs. We'll do that in a future issue, let's focus on the microservices right now.</p>
</li>
<li><p><strong>Frontend:</strong> Obviously, we need a frontend for our app. Let's keep the focus on the microservices, but if you're interested in frontend you might want to check out <a target="_blank" href="https://newsletter.simpleaws.dev/p/aws-amplify-server-side-rendering?utm_source=blog&amp;utm_medium=hashnode">AWS Amplify</a>.</p>
</li>
<li><p><strong>Database design:</strong> Let's assume our database is properly designed. If you're interested in this topic, you should read <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-database-design?utm_source=blog&amp;utm_medium=hashnode">DynamoDB Database Design</a>.</p>
</li>
<li><p><strong>Admin:</strong> Someone has to create the courses, upload the content, course metadata, etc. The operations to do that fall under the scope of our microservices, but I feared it would grow too complex, so I cut those features out.</p>
</li>
</ul>
<p><strong>AWS Services involved:</strong></p>
<ul>
<li><p><strong>ECS:</strong> Our app is already deployed in ECS as a single ECS Service, we're going to split it into 3 microservices and deploy each as an ECS Service. We won't dive deep into ECS, but if you're interested you can learn about <a target="_blank" href="https://newsletter.simpleaws.dev/p/migrate-nodejs-app-from-ec2-to-scalable-ecs-guide?utm_source=blog&amp;utm_medium=hashnode">how to deploy a Node.js application on ECS</a>.</p>
</li>
<li><p><strong>DynamoDB:</strong> Our database for this example.</p>
</li>
<li><p><strong>API Gateway:</strong> Used to expose each microservice.</p>
</li>
<li><p><strong>Elastic Load Balancer:</strong> To balance traffic across all the tasks.</p>
</li>
<li><p><strong>S3:</strong> Storage for the content (video files) of the courses.</p>
</li>
<li><p><strong>ECR:</strong> A Docker registry managed by AWS.</p>
</li>
</ul>
<p>Final design of the app split into microservices</p>
<h2 id="heading-how-to-split-a-monolith-into-microservices">How to Split a Monolith Into Microservices</h2>
<h3 id="heading-step-0-make-the-monolith-modular"><strong>Step 0: Make the Monolith Modular</strong></h3>
<p>The first step should always be to make sure your monolith is already separated into modules with clearly defined responsibilities. Modules should be well scoped, both in terms of functionality and in the code that implements that functionality. They should be cohesive, and lowly coupled to other modules. The level of granularity doesn't matter much, though ideally you'd be splitting modules according to the concept of domains from Doman-Driven Design (you don't need to apply the entirety of Domain-Driven Design). However, you can refine the scope and granularity when you start with microservices. For now, what's important is that you have clearly defined modules with clearly defined responsibilities, instead of a bowl of spaghetti code.</p>
<p>For this example we're going to assume this is already the case, but if you're dealing with a monolith that's not well modularized, that should be the first thing you do. If you commit all the way to microservices, you won't really use the modular monolith. However, I still recommend you first work on separating it into modules, to make the overall process easier by tackling one thing at a time.</p>
<h3 id="heading-step-1-identify-the-microservices"><strong>Step 1: Identify the Microservices</strong></h3>
<p>Start by analyzing the monolithic application, focusing on the course catalog, content delivery, and progress tracking functionalities. Based on these functionalities, outline the responsibilities for each microservice:</p>
<ul>
<li><p><strong>Course Catalog:</strong> manage courses and their metadata.</p>
</li>
<li><p><strong>Content Delivery:</strong> handle storage and distribution of course content.</p>
</li>
<li><p><strong>Progress Tracking:</strong> manage user progress through courses.</p>
</li>
</ul>
<h3 id="heading-step-2-define-the-apis-for-each-microservice"><strong>Step 2: Define the APIs for each microservice</strong></h3>
<p>Once you understand what each microservice needs to do, you need to design the API endpoints for each microservice:</p>
<ul>
<li><p>Course Catalog:</p>
<ul>
<li><p><code>GET /courses</code> → list all courses</p>
</li>
<li><p><code>GET /courses/:id</code> → get a specific course</p>
</li>
</ul>
</li>
<li><p>Content Delivery:</p>
<ul>
<li><code>GET /content/:id</code> → get a pre-signed URL for a specific course content</li>
</ul>
</li>
<li><p>Progress Tracking:</p>
<ul>
<li><p><code>GET /progress/:userId</code> → get a user's progress</p>
</li>
<li><p><code>PUT /progress/:userId/:courseId</code> → update a user's progress for a specific course</p>
</li>
</ul>
</li>
</ul>
<p>API endpoints are how microservices define and expose their functionality to external components. Essentially, the API is what a microservice can do for the user or for other microservices. We already knew the responsibilities of each microservice from Step 1, with this step we're expressing them in technical terms that other components can understand. We're also documenting them in a clear and unambiguous way.</p>
<p>If you're starting from a well-designed modular monolith, these APIs already exist as the APIs for services and interfaces for components, and you're just re-expressing them in a different, unified way. If the starting monolith isn't well modularized, you may find some of these APIs as functions, and you may need to add a few. In those cases it's easier to first modularize the monolith, then split it into microservices.</p>
<p>API design is really important, and hard to do. We're not just splitting the entire app's responsibilities into groups that we call microservices. We're actually creating several apps, that we're then going to interconnect to produce the expected system behavior. We need to not only define those apps' responsibilities well, but also design them in a maintainable way. Check out <a target="_blank" href="https://www.martinfowler.com/articles/consumerDrivenContracts.html"><strong>Fowler's post on consumer-driven contracts</strong></a> for some deeper insights.</p>
<h3 id="heading-step-3-configure-api-gateway-for-each-microservice"><strong>Step 3: Configure API Gateway for each microservice</strong></h3>
<p>Create an API Gateway resource for each microservice (Course Catalog, Content Delivery, and Progress Tracking). Point the different routes to your monolith's APIs for now, since we don't have any microservices yet. Update any frontend code or DNS routes to resolve to the API Gateways.</p>
<p>This isn't a strict requirement, but I added it as part of the solution because it makes the switchover much easier: All we need to do is update the API Gateway of each microservice to point to the newly deployed microservice. Since everything else already depends on that API Gateway for that functionality, we're just changing who's resolving those requests. This way, we've effectively decoupled the API from its implementation. API Gateway also makes other things much easier, such as <a target="_blank" href="https://newsletter.simpleaws.dev/p/securing-microservices-aws-cognito?utm_source=blog&amp;utm_medium=hashnode">managing authentication for microservices</a>.</p>
<h3 id="heading-step-4-create-separate-repositories-and-projects-for-each-microservice"><strong>Step 4: Create separate repositories and projects for each microservice</strong></h3>
<p>Set up individual repositories and Node.js projects for Course Catalog, Content Delivery, and Progress Tracking microservices. Structure the projects using best practices, with separate folders for routes, controllers, and database access code. You know the drill.</p>
<p>This is just the scaffolding, moving the actual code comes in the next step. The key takeaway is that you treat each microservice as a separate project. You could also use a monorepo, where the whole codebase is in a single git repository, each service has its own folder, and it's still deployed separately. This works well when you have a lot of shared dependencies, but in my experience it's harder to pull off.</p>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<hr />
<h3 id="heading-step-5-separate-the-code"><strong>Step 5: Separate the code</strong></h3>
<p>Refactor the monolithic application code, moving the relevant functionality for each microservice into its respective project:</p>
<ul>
<li><p>Move the code related to managing courses and their metadata into the Course Catalog microservice project.</p>
</li>
<li><p>Move the code related to handling storage and distribution of course content into the Content Delivery microservice project.</p>
</li>
<li><p>Move the code related to managing user progress through courses into the Progress Tracking microservice project.</p>
</li>
</ul>
<p>The code in the monolith may not be as clearly separated as you might want. In that case, first refactor as needed until you can copy-paste the implementation code from your monolith to your services (but don't copy it just yet). Then test the refactor. Finally, do the copy-pasting.</p>
<h3 id="heading-step-6-separate-the-data"><strong>Step 6: Separate the data</strong></h3>
<p>First, create separate Amazon DynamoDB tables for each microservice:</p>
<ul>
<li><p><strong>CourseCatalog:</strong> stores course metadata, such as title, description, and content ID.</p>
</li>
<li><p><strong>Content:</strong> stores content metadata, including content ID, content type, and S3 object key.</p>
</li>
<li><p><strong>Progress:</strong> stores user progress, with fields for user ID, course ID, and progress details.</p>
</li>
</ul>
<p>Then update the database access code and configurations in each microservice, so each one interacts with its own table.</p>
<p>Remember that the difference between a service and a <em>micro</em>service is the <a target="_blank" href="https://martinfowler.com/bliki/BoundedContext.html">bounded context</a>. Each microservice owns its domain model, including the data, and the only way to access that model (and the database that stores it) is through that microservice's API.</p>
<p>We could implement this separation of data at the conceptual level, without enforcing it through separate tables. We could even enforce it while keeping all data in a single table, using DynamoDB's field-level permissions. The problem with that idea (aside from the permissions nightmare) is that we wouldn't be able to scale the services independently, since DynamoDB capacity is managed per table.</p>
<p>If you're doing this for a database which already has data, but you can tolerate the system being offline during the migration, you can <a target="_blank" href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.HowItWorks.html">export the data to S3</a>, use <a target="_blank" href="https://aws.amazon.com/glue/">Glue</a> to filter the data, and then <a target="_blank" href="https://aws.amazon.com/blogs/database/amazon-dynamodb-can-now-import-amazon-s3-data-into-a-new-table/">import it back to DynamoDB</a>.</p>
<p>If the system is live, this step gets trickier. Here's how you can split a DynamoDB table with minimal downtime:</p>
<ul>
<li><p>First, add a timestamp to your data if you don't have one already.</p>
</li>
<li><p>Next, create the new tables.</p>
</li>
<li><p>Then set up <a target="_blank" href="https://newsletter.simpleaws.dev/p/dynamodb-streams-reacting-to-changes?utm_source=blog&amp;utm_medium=hashnode">DynamoDB Streams</a> to replicate all future writes to the new table. You'll need to set one stream per microservice. It's easier if you set it up to copy all the data and after the switchover you delete the irrelevant data. But if you're performing a lot of writes, it will be cheaper to selectively copy only the data that belongs to the microservice.</p>
</li>
<li><p>Then copy the old data, either with a script or with an S3 export + Glue (don't use the DynamoDB import, it only works for new tables, write the data manually instead). Make sure this can handle duplicates.</p>
</li>
<li><p>Finally, switch over to the new tables.</p>
</li>
</ul>
<p>I picked DynamoDB for this example because DynamoDB tables are easy to create and manage (other than designing the data model). In a relational database we would need to consider the tradeoff between having to manage (and pay for) one DB cluster per microservice, or having different databases in the same cluster. The latter is definitely cheaper, but it can get harder to manage permissions, and we lose the ability to scale the data stores independently. Aurora Serverless is a viable alternative, it scales very similarly to DynamoDB in Provisioned Mode. However, it's 4x more expensive than serverful Aurora.</p>
<h3 id="heading-step-7-build-and-deploy-the-microservices"><strong>Step 7: Build and Deploy the Microservices</strong></h3>
<p>We're using ECS for this example, just so we can focus on the microservices part, instead of debating over how to deploy an app. These are the steps to deploy in ECS, which you'll need to do separately for each microservice:</p>
<ul>
<li><p>Write a Dockerfile specifying the base image, copying the source code, installing packages, and setting the appropriate entry point. Test this, obviously.</p>
</li>
<li><p>Build and push the Docker image to an Amazon Elastic Container Registry (ECR) registry. You'll use a separate registry for each microservice (remember they're separate apps).</p>
</li>
<li><p>Create a Task Definition in Amazon ECS, specifying the required CPU, memory, environment variables, and the ECR image URL.</p>
</li>
<li><p>Create an ECS service, associating it with the corresponding Task Definition. Make sure this is working properly.</p>
</li>
</ul>
<p>I don't want to dive too deep into how to deploy an app to ECS. If you're not sure how to do it, <a target="_blank" href="https://newsletter.simpleaws.dev/p/migrate-nodejs-app-from-ec2-to-scalable-ecs-guide?utm_source=blog&amp;utm_medium=hashnode">here's an article I wrote about it</a>.</p>
<h3 id="heading-step-8-update-api-gateway"><strong>Step 8: Update API Gateway</strong></h3>
<p>For each API in API Gateway, you'll need to update the routes to point to the newly deployed microservice, instead of to the monolith. First do it on a testing stage, even if you already ran everything in a separate dev environment. Then configure a <a target="_blank" href="https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html">canary release</a>, and let the microservice gradually take traffic.</p>
<p>You might want to preemptively scale the microservice way beyond the expected capacity requirement. One hour of overprovisioning will cost you a lot less than angry customers.</p>
<h2 id="heading-user-interaction-in-the-monolith-vs-in-microservices">User Interaction in the Monolith vs in Microservices</h2>
<p>Here's the journey for a user viewing a course in our monolith:</p>
<ol>
<li><p>The user sends a login request with their credentials to the monolithic application.</p>
</li>
<li><p>The application validates the credentials and, if valid, generates an authentication token for the user.</p>
</li>
<li><p>The user sends a request to view a course, including the authentication token in the request header.</p>
</li>
<li><p>The application checks the authentication token and retrieves the course details from the Courses table in DynamoDB.</p>
</li>
<li><p>The application retrieves the course content metadata from the Content table in DynamoDB, including the S3 object key.</p>
</li>
<li><p>Using the S3 object key, the application generates a pre-signed URL for the course content from Amazon S3.</p>
</li>
<li><p>The application responds with the course details and the pre-signed URL for the course content.</p>
</li>
<li><p>The user's browser displays the course details and loads the course content using the pre-signed URL.</p>
</li>
</ol>
<p>And here's the same functionality in our microservices:</p>
<ol>
<li><p>The user sends a login request with their credentials to the authentication service (not covered in the previous microservices example).</p>
</li>
<li><p>The authentication service validates the credentials and, if valid, generates an authentication token for the user.</p>
</li>
<li><p>The user sends a request to view a course, including the authentication token in the request header, to the Course Catalog microservice through API Gateway.</p>
</li>
<li><p>The Course Catalog microservice checks the authentication token and retrieves the course details from its Course Catalog table in DynamoDB.</p>
</li>
<li><p>The Course Catalog microservice responds with the course details.</p>
</li>
<li><p>The user's browser sends a request to access the course content, including the authentication token in the request header, to the Content Delivery microservice through API Gateway.</p>
</li>
<li><p>The Content Delivery microservice checks the authentication token and retrieves the course content metadata from its Content table in DynamoDB, including the S3 object key.</p>
</li>
<li><p>Using the S3 object key, the Content Delivery microservice generates a pre-signed URL for the course content from Amazon S3.</p>
</li>
<li><p>The Content Delivery microservice responds with the pre-signed URL for the course content.</p>
</li>
<li><p>The user's browser displays the course details and loads the course content using the pre-signed URL.</p>
</li>
</ol>
<h2 id="heading-best-practices-for-microservices-on-aws">Best Practices for Microservices on AWS</h2>
<h3 id="heading-operational-excellence">Operational Excellence</h3>
<ul>
<li><p><strong>Centralized logging:</strong> You're basically running 3 apps. Store the logs in the same place, such as CloudWatch Logs (which ECS automatically configures for you).</p>
</li>
<li><p><strong>Distributed tracing:</strong> These three services don't call each other, but in a real microservices app it's a lot more common for that to happen. In those cases, following the trail of calls becomes rather difficult. <a target="_blank" href="https://newsletter.simpleaws.dev/p/using-aws-xray-observability-eventdriven-architectures?utm_source=blog&amp;utm_medium=hashnode">Use X-Ray</a> to make it a lot simpler.</p>
</li>
</ul>
<h3 id="heading-security">Security</h3>
<ul>
<li><p><strong>Least privilege:</strong> It's not enough to not write the code to access another service's data, you should also enforce it via IAM permissions. Your microservices should each use a different IAM role, that lets each access its own DynamoDB table, not <code>*</code>.</p>
</li>
<li><p><strong>Networking:</strong> If a service doesn't need network visibility, it shouldn't have it. Enforce it with security groups.</p>
</li>
<li><p><strong>Zero trust:</strong> The idea is to not trust agents inside a network, but instead authenticate at every stage. Exposing your services through API Gateway gives you an easy way to do this. Yes, you should do this even when exposing them to other services.</p>
</li>
</ul>
<h3 id="heading-reliability">Reliability</h3>
<ul>
<li><p><strong>Circuit breakers:</strong> User calls Service A, Service A calls Service B, Service B fails, the failure cascades, everything fails, your car is suddenly on fire (just go with it), your boss is suddenly on fire (is that a bad thing?), everything is on fire. Circuit breakers act exactly like the electric versions: They prevent a failure in one component from affecting the whole system. <a target="_blank" href="https://www.martinfowler.com/bliki/CircuitBreaker.html">I'll let Fowler explain</a>.</p>
</li>
<li><p><strong>Consider different scaling speeds:</strong> If Service A depends on Service B, consider that Service B scales independently, which could mean that instances of Service B are not started as soon as Service A gets a request. Service B could be implemented in a different platform (EC2 Auto Scaling vs Lambda), which scales at a different speed. Keep that in mind for service dependencies, and <a target="_blank" href="https://newsletter.simpleaws.dev/p/using-sns-decouple-components?utm_source=blog&amp;utm_medium=hashnode">decouple the services</a> when you can.</p>
</li>
</ul>
<h3 id="heading-performance-efficiency">Performance Efficiency</h3>
<ul>
<li><p><strong>Scale services independently:</strong> Your microservices are so independent that even their databases are independent! You know what that means? You can scale them at will!</p>
</li>
<li><p><strong>Rightsize ECS tasks:</strong> Now that you split your monolith, it's time to check the resource usage of each microservice, and fine-tune them independently.</p>
</li>
<li><p><strong>Rightsize DynamoDB tables:</strong> Same as above, for the database tables.</p>
</li>
</ul>
<h3 id="heading-cost-optimization">Cost Optimization</h3>
<ul>
<li><p><strong>Optimize capacity:</strong> Determine how much capacity each service needs, and optimize for it. Get a savings plan for the baseline capacity.</p>
</li>
<li><p><strong>Consider different platforms:</strong> Different microservices have different needs. A user-facing microservice might need to scale really fast, at the speed of Fargate or Lambda. A service that only processes asynchronous transactions, such as a payments-processing service, probably doesn't need to scale as fast, and can get away with an Auto Scaling Group (which is cheaper per compute time). A batch processing service could even use Spot Instances! Every service is independent, so don't limit yourself.</p>
</li>
<li><p><strong>Consider the increased management efforts:</strong> It's easier (thus cheaper) to manage 10 Lambda functions than to manage 5 Lambda functions, 1 ECS cluster and 2 Auto Scaling Groups.</p>
</li>
</ul>
<hr />
<p>Stop copying cloud solutions, start <strong>understanding</strong> them. Join over 45,000 devs, tech leads, and experts learning how to architect cloud solutions, not pass exams, with the <a target="_blank" href="https://newsletter.simpleaws.dev?utm_source=blog&amp;utm_medium=hashnode">Simple AWS newsletter</a>.</p>
<ul>
<li><p><strong>Real</strong> scenarios and solutions</p>
</li>
<li><p>The <strong>why</strong> behind the solutions</p>
</li>
<li><p><strong>Best practices</strong> to improve them</p>
</li>
</ul>
<p><a target="_blank" href="https://newsletter.simpleaws.dev/subscribe?utm_source=blog&amp;utm_medium=hashnode">Subscribe for free</a></p>
<iframe src="https://embeds.beehiiv.com/1c90a8a9-57b7-4a3f-ac56-5f05d0121f72?slim=true" style="margin:0;border-radius:0px;background-color:transparent;display:block;margin-left:auto;margin-right:auto" height="55px"></iframe>

<p>If you'd like to know more about me, you can find me <a target="_blank" href="https://www.linkedin.com/in/guilleojeda/">on LinkedIn</a> or at <a target="_blank" href="https://www.guilleojeda.com?utm_source=blog&amp;utm_medium=hashnode">www.guilleojeda.com</a></p>
]]></content:encoded></item></channel></rss>