SAA-C03 Cheat Sheet — AWS Certified Solutions Architect – Associate Study Guide

Design Secure Architectures

Secure Access to AWS Resources

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Give workloads an IAM role, never embedded static keys

An IAM role attached to compute hands out short-lived credentials that AWS rotates automatically, so there is nothing long-lived to leak: attach one to anything that calls AWS. Each service has its own role flavor: EC2 gets an instance profile, Lambda an execution role, ECS a task role, EKS uses IRSA, and GitHub Actions federates via OIDC AssumeRoleWithWebIdentity. Static access keys are the fallback only when something genuinely can't assume a role, like legacy SaaS that can't do OIDC.

Trap Baking long-lived access keys into an instance or your code: a permanent secret leaks once and stays compromised forever, whereas role credentials expire and rotate on their own.

3 questions test this

5+ humans → federate through IAM Identity Center, not IAM users

When workforce access reaches real scale, keep the identities in your existing IdP (AD, Okta, Entra, or Google) and bind them through Identity Center permission sets, assigned as (group × account × permission set) so access stays centralized and group-managed instead of being copied into every account. AWS renamed AWS SSO to IAM Identity Center^[7] in 2022, so treat both names as one service. Stand-alone IAM users and self-managed federation fragment identity across accounts instead of driving it from one source.

Trap Reaching for local IAM users or Cognito to handle staff access: local IAM users only make sense for a 2-3-person startup, and Cognito User Pools are for your end-customers rather than staff, so neither will scale or centralize.

2 questions test this

A permission boundary caps maximum permissions, it never grants any

A permission boundary sets a ceiling on what an IAM principal can do, effective permissions are the intersection identity Allow ∩ boundary Allow ∩ no Deny ∩ no SCP Deny, so it can only ever subtract, never add. That makes it the right tool for delegated administration: developers create their own roles while the boundary guarantees those roles can never exceed it, stopping iam:* privilege escalation without you reviewing every role by hand.

Trap Expecting a permission boundary to grant access: it only ever intersects, so an action missing from the identity policy stays denied no matter what the boundary allows.

5 questions test this

Cross-account access = role with a trust policy + sts:AssumeRole

Cross-account access works by account B creating a role whose trust policy names principals in account A; A then calls sts:AssumeRole and receives temporary credentials, so access flows through short-lived sessions rather than shared secrets and every assume can be audited and conditioned. Harden the trust policy for the situation: add sts:ExternalId when the caller is a third-party SaaS to defeat the confused deputy, and require aws:MultiFactorAuthPresent when a human assumes the role.

Trap Sharing access keys across accounts just to skip setting up a role: that hands out permanent credentials and loses the per-assume conditions like ExternalId and MFA that a trust policy gives you.

8 questions test this

Third-party assuming your role → pin sts:ExternalId, not just their ARN

When a SaaS vendor assumes a role in your account, pin sts:ExternalId^[12] in the trust policy's Condition: the vendor issues a unique ID for your account and your trust policy checks it on every AssumeRole, rejecting any request that doesn't carry your specific ID. This closes the confused-deputy gap. The vendor's own role is the principal, but the ExternalId proves the request was meant for your account and not another customer's.

Trap Pinning only the vendor's principal ARN: it feels like enough but isn't, because another customer of that same vendor could be tricked into supplying your role ARN, the classic confused-deputy attack.

4 questions test this

SCPs only restrict member accounts; they never grant access

Service Control Policies^[10] attach at an AWS Organizations root or OU and set the outer guardrail on what member accounts can do, capping every principal in those accounts. But an SCP only restricts, never permits, so it shapes the ceiling rather than opening a door. Granting actual access is still the job of identity and resource policies inside the accounts; the SCP just bounds what those policies are allowed to reach.

Trap Reaching for an SCP to let account B read a bucket in account A: that's the wrong layer, because access comes from a bucket policy in A plus a role in A that a principal in B assumes.

5 questions test this

Service needs to act on your behalf → it's a service-linked role

A service-linked role^[11] is one AWS auto-creates, predefines, and manages when you enable a feature like Auto Scaling, Organizations, ECS, or Lambda@Edge, so the service has exactly the permissions it needs to act for you with no policy authoring from you. Because AWS owns the role's permissions, you can't reshape it the way you'd scope a normal role. The answer to "how does Service X act on my behalf" is usually that the SLR already exists.

Trap Hand-authoring a custom role for a feature that uses a service-linked role. AWS owns the SLR's permissions, so you can't reshape it the way you'd scope an ordinary role.

Block IMDS SSRF by enforcing IMDSv2 (HttpTokens=required)

IMDSv2^[2] requires a caller to first fetch a session token with a PUT before any metadata GET, breaking the SSRF chain where an app flaw tricks the server into reading role credentials from 169.254.169.254. A forged GET can't complete the PUT handshake. Enforce it on the launch template with HttpTokens=required, or org-wide with an SCP that denies any launch not requiring it, so no instance can quietly fall back to token-less IMDSv1.

Trap Leaving IMDSv1 on just because the instance "is behind a load balancer": IMDSv1 needs no token, so any server-side request forgery in your app can reach the metadata endpoint and exfiltrate the role credentials.

An explicit Deny anywhere wins over every Allow

An explicit Deny in any policy layer short-circuits the whole evaluation^[13], so no Allow anywhere can override it. IAM evaluates SCP, permission boundary, identity policy, and resource policy together, and a single Deny among them ends the decision. A deny is therefore the reliable way to carve out a hard exception, and troubleshooting "why is this blocked" means hunting for the explicit Deny rather than piling on more Allows.

Trap Assuming an admin with *:* can do anything. When "Admin can't access the bucket" the cause is almost always an explicit Deny on their principal in the resource policy.

1 question tests this

A company has an S3 bucket that contains sensitive financial data. A solutions architect must ensure that only an IAM role named…

Role sessions are 1h by default, up to 12h via MaxSessionDuration

A role session defaults to 1 hour and extends through the role's MaxSessionDuration^[14], settable between 1 and 12 hours, while IAM Identity Center permission sets carry their own 1–12h duration. Short sessions are the security default because the credentials expire on their own, so you raise the cap only as far as a genuinely long-running task needs.

Trap Expecting a federated session to run the full MaxSessionDuration: it's capped by both the IdP token lifetime and the role's MaxSessionDuration, and the shorter of the two wins.

Central admin into Org accounts → OrganizationAccountAccessRole

The OrganizationAccountAccessRole is created automatically with full admin in any account that joins AWS Organizations^[15] (whether created in or invited into the org) and it trusts the management account, so the cross-account admin path is pre-wired and a central security account simply assumes it. That removes the per-account bootstrap of hand-building a trust role each time a new account joins.

Trap Hand-building a cross-account trust role in every new Org account: OrganizationAccountAccessRole is created automatically and already trusts the management account.

Find accidental public/cross-account grants with IAM Access Analyzer

IAM Access Analyzer^[16] reasons over resource policies on S3 buckets, IAM roles, KMS keys, Lambda functions, SQS queues, and Secrets Manager and flags any grant to a principal outside your account or organization, catching the "this bucket policy accidentally allows the internet" problem before an attacker does. It's free to enable per region, the systematic alternative to manual policy review that can't keep up as policies multiply.

Trap Reaching for GuardDuty or Macie to catch an over-permissive resource policy: those watch threats and data, whereas IAM Access Analyzer reasons over the policies themselves for external access.

Grant a resource to your whole org with aws:PrincipalOrgID

The aws:PrincipalOrgID global condition key, placed in a resource policy such as a bucket policy, admits only principals whose accounts belong to the named AWS Organization, so you express "anyone in my org" once instead of enumerating account IDs. Because it matches on org membership rather than a fixed list, any account you add later is covered automatically with no policy edit.

Trap Listing every account ID in the Principal element instead: that doesn't scale and silently leaves out new accounts, whereas aws:PrincipalOrgID grows with the org.

4 questions test this

Cross-account S3 needs BOTH a bucket policy and an IAM policy

Cross-account S3 is authorized only by the intersection of both sides when the requesting identity and the bucket sit in different accounts: a bucket policy in the resource account granting the action to the cross-account principal, plus an identity-based policy in the requesting account allowing that principal to reach the bucket ARN. Either side alone leaves a gap, because each account independently has to consent to the access.

Trap Setting only the bucket policy (or only the IAM policy) for cross-account S3: either side alone leaves a gap, because both accounts must independently allow the access.

8 questions test this

Require recent MFA on a role with aws:MultiFactorAuthAge

The aws:MultiFactorAuthAge condition on a trust policy gates role assumption on recently re-authenticated MFA: set a NumericLessThan in seconds, where 3600 requires the MFA to be under an hour old. This puts sensitive actions behind a fresh step-up, because the condition measures how long ago the MFA happened rather than merely that it ever happened.

Trap Using aws:MultiFactorAuthPresent instead: it only confirms MFA happened at some point in the session, not how recently, so a session authenticated hours ago still passes.

Secure Workloads and Applications

Read full chapter

Layer your security controls so one misconfig isn't game over
Keep in-VPC traffic to AWS services off the internet with VPC endpoints
Security groups are stateful; NACLs are stateless
Need credentials that rotate? Use Secrets Manager, not static secrets
WAF only attaches to CloudFront, ALB, API Gateway, and AppSync
Shield Advanced costs $3,000/month per payer account, not per linked account
Secrets Manager rotation runs on a configurable cron/rate schedule
GuardDuty detects and Security Hub aggregates, but neither remediates
Static config goes in Parameter Store; auto-rotating credentials go in Secrets Manager
Auto-remediate GuardDuty findings with EventBridge → Lambda
Inspector continuously scans EC2, ECR, and Lambda for you automatically
For VPC-level traffic inspection across subnets, use AWS Network Firewall
Need zero auth failures during rotation? Use the alternating-users strategy
For OWASP Top 10 coverage with no maintenance, use AWS Managed Rules
Throttle HTTP floods per source IP with WAF rate-based rules
For tier-to-tier traffic, reference a security group ID as the source instead of a CIDR
WAF evaluates the lowest priority number first, and Allow and Block are terminating
To exclude specific traffic from a managed rule group, use a scope-down statement
Custom NACLs deny everything until you add allows, and rule number sets precedence
Lock S3 access to one VPC endpoint with a bucket policy using aws:SourceVpce

Unlock with Premium — includes all practice exams and the complete study guide.

Data Security Controls

Read full chapter

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Encrypt compliance data both at rest and in transit

Compliance data needs protection on both legs of its life (audits like HIPAA, PCI, and GDPR require it), so encrypt at rest with SSE/KMS and in transit with TLS, because neither covers the other. S3 already applies SSE-S3 by default since Jan 2023, so the at-rest leg is mostly handled, but in transit is on you: enforce it with a bucket policy that denies any request where aws:SecureTransport is false, forcing every connection onto TLS.

Trap Assuming S3's default SSE-S3 also protects data in transit: at-rest encryption says nothing about TLS; enforce the transit leg separately by denying aws:SecureTransport=false.

3 questions test this

Pick a KMS key type by control and cost, lowest first

Pick the cheapest KMS key tier that still meets your control requirement, climbing only as far as you need. AWS-owned keys are free but give no visibility; AWS-managed keys are free and add a CloudTrail audit trail; customer-managed keys cost ~$1/key/month plus per-request but unlock your own key policy, cross-account sharing, and rotation control; and CloudHSM is the most expensive, giving a dedicated single-tenant FIPS 140-2 Level 3 module. The control you require pushes you up a tier, not a default reflex.

Trap Reaching for CloudHSM for routine encryption: it's only worth it when you genuinely need single-tenant FIPS 140-2 Level 3, otherwise it just costs more than a customer-managed key.

5 questions test this

Control S3 access at the bucket level, not per object

Govern an S3 bucket with account- and bucket-level controls (Block Public Access plus default encryption) so protection applies uniformly and can't be undone object by object. Bucket-wide controls are the right altitude because they're auditable and don't drift, whereas per-object ACLs and policies multiply with object count, so they don't scale and are painful to verify across a large bucket.

Trap Securing a bucket with per-object ACLs: they're easy to misconfigure and impossible to audit at scale, which is exactly why Block Public Access exists to override them.

3 questions test this

Use Macie to find and classify sensitive data in S3

Macie is the managed service for discovering and classifying sensitive data (PII, PHI, financial data, credentials) in S3, using AWS-managed identifiers plus your own custom regex and routing findings through EventBridge for auto-remediation. It works on data already stored, scanning asynchronously, and runs ~$1/GB scanned, which is why you point it at the buckets that matter rather than sweeping the whole org blindly.

Trap Expecting Macie to be a real-time inline guard on uploads: it's an asynchronous discovery and classification job over data at rest, not an upload-path filter.

12 questions test this

New S3 buckets have been SSE-S3 encrypted by default since Jan 2023

Every new S3 bucket gets SSE-S3 automatically^[1] since January 2023 with no action from you, and you can upgrade a bucket to SSE-KMS whenever you want tighter key control. The change is forward-looking, so buckets and objects created before 2023 may still be unencrypted: find them with the AWS Config rule s3-bucket-server-side-encryption-enabled rather than assuming the default reached back in time.

Trap Assuming the Jan-2023 default went back and encrypted old objects: it only applies to new buckets and objects, so legacy data stays unencrypted until you audit it.

KMS automatic rotation is yearly and opt-in for customer-managed keys

Customer-managed keys rotate the underlying key material automatically once a year^[18] (the 365-day default, configurable to a shorter custom period via RotationPeriodInDays) but only after you opt in with the EnableKeyRotation API. It's off until you ask for it. AWS-managed keys already rotate yearly with nothing to configure. The exception is imported key material: KMS can't regenerate material it didn't create, so it never auto-rotates and you rotate it by re-importing.

Trap Turning on EnableKeyRotation and assuming imported key material rotates too. It's excluded, so it stays static until you manually re-import.

2 questions test this

KMS deletion is never instant: a 7–30 day window you can cancel

Deleting a KMS key is never instant: ScheduleKeyDeletion^[19] starts a mandatory 7–30 day waiting period, and you can CancelKeyDeletion anytime inside it. The delay is deliberate: destroying a key makes everything it encrypted permanently unreadable, so the window is your chance to catch a mistake before it's irreversible. CloudHSM is the contrast: it lets you destroy a key immediately because you own the HSM.

Trap Counting on a deleted KMS key being recoverable like a soft-deleted resource: once the window elapses it's gone for good, so CancelKeyDeletion during the window is your only safety net.

Macie cost scales with data, so scan once then target critical buckets

Macie charges ~$1/GB scanned plus per-object analysis, so the bill tracks how much data you point it at: steep on petabyte-scale buckets, and rescanning unchanged data adds cost without adding signal. The cost-aware pattern is one full discovery pass after a migration, then scheduled monthly scans limited to critical buckets, using excludes^[15] filters to skip large prefixes you already know are non-sensitive.

Trap Running continuous org-wide Macie discovery to be thorough: rescanning the same data over and over just multiplies the per-GB cost for no new signal, so scope and schedule it instead.

3 questions test this

Give a short-lived workload a key with a KMS grant, not a policy edit

When a short-lived workload needs key access, issue a grant^[20], a temporary, programmatic permission you create with kms:CreateGrant and retire when done, because a grant is scoped to specific operations and goes away cleanly. Editing the key policy is permanent and shared: it changes who can use the key until someone edits it back, so it's wrong for ephemeral consumers. This is exactly the mechanism AWS services like RDS encryption use, creating grants for you behind the scenes.

Trap Editing the key policy for every ephemeral consumer: it permanently bloats the policy and has to be revoked by hand, whereas a grant is scoped and retirable for exactly this case.

3 questions test this

Lock a KMS key to one service with the kms:ViaService condition

The kms:ViaService^[21] condition in a key policy (e.g. Condition: { StringEquals: { kms:ViaService: "s3.us-east-1.amazonaws.com" } }) restricts a key so it can be used only when the request comes through that one service in that one Region. It checks which service is making the call on a principal's behalf, defending against stolen credentials being replayed against the key through some other service: a containment control that complements, not replaces, the permissions on the principal itself.

Trap Assuming kms:ViaService restricts which principal can use the key: it constrains the calling service and Region, not the identity, so it pairs with principal-level permissions rather than replacing them.

4 questions test this

Use S3 Object Lock for immutable retention. Compliance mode blocks even root

S3 Object Lock provides WORM (write-once-read-many) immutability in two modes: Governance, where a privileged admin can still override the lock, and Compliance, where nobody (not even the root user) can shorten retention or delete during the window. Compliance satisfies records rules like SEC 17a-4(f), FINRA, and CFTC^[22] precisely because no one can tamper with it. You apply it per object or as a bucket default retention policy, and versioning must be enabled first.

Trap Using Governance mode for a regulatory WORM requirement: a privileged admin can override it, so only Compliance mode satisfies SEC 17a-4(f)-style immutability.

2 questions test this

An ACM cert for CloudFront has to live in us-east-1

An ACM certificate attached to CloudFront must be requested or imported in US East (N. Virginia) / us-east-1, regardless of where your origin or users sit, because CloudFront is a global service that distributes that single us-east-1 certificate out to every edge location. A cert created in any other Region won't appear in the CloudFront console for you to select.

Trap Requesting the ACM cert in the origin's Region where your ALB or S3 bucket lives: CloudFront only sees certificates in us-east-1, so it'll never appear for you to attach.

4 questions test this

For cross-account S3 uploads, set Object Ownership to Bucket owner enforced

Setting S3 Object Ownership to Bucket owner enforced makes the bucket owner automatically own every object (even ones uploaded from another account) while ACLs are disabled entirely, so access is governed purely by bucket and IAM policies. That's the clean fix for cross-account uploads, where by default the uploading account retains ownership of its objects and leaves the bucket owner unable to read them. It's now the recommended default for new buckets for exactly this reason.

Trap Granting the bucket owner an ACL on each cross-account object instead: that's brittle per-object work, whereas Bucket owner enforced transfers ownership automatically and removes ACLs entirely.

10 questions test this

Org-level S3 Block Public Access overrides account and bucket settings

S3 Block Public Access enforced at the AWS Organizations root or OU propagates to every member account, including ones that join later, and overrides account- and bucket-level settings so a local admin can't switch it off. That central enforcement removes accidental public exposure as an option across the whole org. If one account legitimately needs a public bucket, the carve-out is made at the org by excluding that account, not by a local toggle.

Trap Telling a member-account admin to just toggle Block Public Access off locally: the org-level setting wins, so the carve-out has to happen at the organization, not the account.

7 questions test this

On high-traffic SSE-KMS buckets, enable S3 Bucket Keys to cut KMS costs ~99%

On a high-traffic SSE-KMS bucket, enable S3 Bucket Keys: KMS hands S3 a short-lived bucket-level key from which S3 derives the per-object data keys itself, collapsing many per-object KMS calls into far fewer, cutting KMS request charges by up to ~99%. There's no security trade-off, because every object is still encrypted under your customer managed key; you've only changed how often S3 has to call KMS.

Trap Switching from SSE-KMS to SSE-S3 just to dodge per-request KMS charges: that throws away the customer-managed-key control, whereas Bucket Keys cut the cost while keeping the CMK.

6 questions test this

Cross-account KMS access needs both a key policy and an IAM policy

Cross-account use of a customer managed KMS key takes consent from both sides: the key policy in the owning account must grant the external account or principal, and an IAM policy in the external account must allow those principals to use that specific key ARN. The key policy decides who may have access and the IAM policy decides who actually does. Both have to line up, so neither alone authorizes the call.

Trap Setting only the key policy as you would for same-account access. Cross-account principals also need an explicit IAM allow on the key ARN, or the call still fails.

8 questions test this

Changing default encryption doesn't touch existing objects: re-encrypt with S3 Batch Operations

Updating a bucket's default encryption is forward-looking only: it applies to newly uploaded objects while everything already stored keeps its original encryption, so flipping the setting alone never re-keys old data. To re-encrypt objects already in the bucket (even billions) under a new SSE-KMS customer managed key, run S3 Batch Operations with the Copy operation, copying objects back into the same bucket with the new encryption, which rewrites each object under the new key.

Trap Assuming flipping the bucket's default encryption re-encrypts what's already stored: it's forward-looking only, so old objects stay on the old key until you copy them.

4 questions test this

KMS keys are Regional, so cross-Region replicas need a key in the target Region

A KMS key is a Regional resource and can't be invoked from another Region, so any cross-Region encrypted copy needs its own key in the destination Region. An encrypted cross-Region RDS read replica must be given a customer managed (or AWS managed) key that exists in the target Region, and an S3 bucket can only use a same-Region KMS key for SSE-KMS. The encryption key always has to be local to the data.

Trap Pointing the destination Region at the source key's ARN: KMS won't resolve it across Regions, so the replica creation fails until a key exists locally.

4 questions test this

Run org-wide Macie from a delegated security account, not the management account

In AWS Organizations the management account designates a dedicated security account as the Macie delegated administrator, and that account is where you enable Macie across members, run org-wide discovery jobs, and aggregate findings centrally. Delegating this out of the management account keeps day-to-day security operations (and their blast radius) away from the most privileged account in the org. Automated sensitive data discovery then samples objects to score each bucket's sensitivity cost-efficiently.

Trap Operating Macie straight from the Organizations management account: AWS guidance is to delegate to a separate security account to limit management-account exposure and blast radius.

6 questions test this

For CloudFront HTTPS with a custom cert, use free SNI SSL, not Dedicated IP

For CloudFront HTTPS with your custom certificate, use SNI (Server Name Indication) SSL, which lets one IP present the right cert per hostname and carries no extra monthly charge. Every browser released after 2010 supports it, so it's the right default for essentially all modern traffic. Dedicated IP SSL exists only for the rare legacy client that can't do SNI, and it adds a per-distribution monthly fee, so you pick it for compatibility, never for robustness.

Trap Picking Dedicated IP SSL to look more robust: it just adds per-distribution monthly cost, whereas SNI is the default and works for every modern browser.

6 questions test this

To encrypt CloudFront-to-origin, set Origin Protocol Policy to HTTPS Only with min TLS 1.2

To encrypt the leg between CloudFront and a custom origin, set the Origin Protocol Policy to HTTPS Only with a minimum origin SSL protocol of TLSv1.2, forcing every origin fetch onto modern TLS. Once HTTPS Only is on, CloudFront validates the origin's certificate and returns HTTP 502 (Bad Gateway) if it's self-signed or otherwise untrusted, so the origin must present a certificate signed by a trusted CA for the connection to succeed.

Trap Leaving a self-signed cert on the origin under HTTPS Only and then blaming the 502 on CloudFront. CloudFront refuses untrusted origin certs, so it needs a CA-signed certificate.

3 questions test this

SSE-KMS needs kms:GenerateDataKey to upload and kms:Decrypt to download

SSE-KMS splits the two directions across two permissions: uploading makes S3 ask KMS to generate a data key, needing kms:GenerateDataKey, while downloading makes S3 decrypt that data key, needing kms:Decrypt. An app that both reads and writes must hold both, because a policy missing either one fails only the matching S3 operation with AccessDenied while the other keeps working, which is what makes the gap confusing to diagnose.

Trap Granting only kms:Decrypt just because the app reads objects: uploads will then fail with AccessDenied, since writing SSE-KMS objects requires kms:GenerateDataKey.

5 questions test this

Cross-account secret access needs a secret resource policy and the KMS key policy

Cross-account reads of a Secrets Manager secret require two grants together: a resource-based policy on the secret allowing secretsmanager:GetSecretValue, and a policy on the encrypting KMS key allowing kms:Decrypt, because retrieving the secret means decrypting it. This is why the default aws/secretsmanager key doesn't work cross-account: its key policy is immutable and can't be edited to admit the other account, so cross-account access forces you onto a customer managed key.

Trap Encrypting the secret with the default aws/secretsmanager key and expecting cross-account access: its key policy can't be edited to grant the other account, so you have to use a customer managed key.

4 questions test this

Scope a rotation Lambda's KMS decrypt to one secret with kms:EncryptionContext:SecretARN

A rotation Lambda for a secret encrypted with a customer managed key needs kms:Decrypt on that key, but granting it plainly lets it decrypt every secret sharing the key. Add a condition on kms:EncryptionContext:SecretARN and the decrypt permission is confined to the one secret it's meant to rotate: least privilege in action, since the encryption context is bound to the secret's ARN and KMS only allows the decrypt when that context matches.

Trap Granting the rotation function plain kms:Decrypt on the shared key: it can then decrypt every secret encrypted with that key, so the SecretARN encryption-context condition is what confines it.

Restrict what flows through a VPC endpoint with an endpoint policy

An Interface or Gateway endpoint can carry its own endpoint policy^[23] that limits which API actions and resources may pass through it, for example an S3 Gateway endpoint scoped to permit s3:GetObject on your buckets only and deny everything else routed through it. It's a filter on the path, not a grant of access, so it's an extra constraint evaluated alongside the principal's IAM and the bucket policy rather than replacing them.

Trap Assuming an endpoint policy grants access on its own: it only filters what the endpoint permits, so the principal still needs IAM and bucket permissions, and the two are evaluated together.

Design Resilient Architectures

Scalable and Loosely Coupled Architectures

Read full chapter

Pick the integration primitive by how consumers consume
Every queue needs a DLQ and retry policy or it loses messages
Set SQS visibility timeout to at least 6× the Lambda timeout
FIFO caps at 300/3 000 TPS unless high-throughput mode is on
SNS doesn't retain messages, so front it with SQS for durability
Use EventBridge archive and replay to rebuild or test consumers from history
Step Functions: Standard for durable orchestration, Express for high throughput
Use EventBridge Pipes to wire a source to a target without Lambda glue
Turn on SQS long polling to cut empty receives and cost
API Gateway throttles at the account, stage, and method levels
Scale an ASG by traffic with ALBRequestCountPerTarget target tracking
Use Step Functions Distributed Map for million-scale S3 parallel processing
Pause a workflow for an external callback with .waitForTaskToken
Combine Retry exponential backoff with Catch for fallback handling
Catch States.ALL on a Parallel state to intercept any branch failure
Use ReportBatchItemFailures so only failed SQS messages reprocess
DLQ retention must exceed the source queue's, or messages expire on arrival
Use EventBridge global endpoints for automatic multi-region failover
Attach a DLQ per EventBridge rule target to capture undeliverable events

Unlock with Premium — includes all practice exams and the complete study guide.

Highly Available and Fault-Tolerant Architectures

Read full chapter

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Multi-AZ keeps you up inside a region; Multi-Region survives losing the whole region

Multi-AZ delivers in-region high availability: you spread stateless tiers across at least 2 AZs behind a load balancer, and RDS Multi-AZ adds a synchronous standby in another AZ that takes over in 60-120 s. Together that survives any single AZ going down without losing data. Multi-Region is a far bigger commitment, bringing real latency, cost, and replication complexity, so reach for it only when losing an entire region is actually in your threat model. Most availability requirements are satisfied by Multi-AZ; Multi-Region is for regional disaster recovery, not everyday uptime.

Trap Treating Multi-AZ as disaster recovery. It covers you when one AZ goes down, not when the whole region does.

3 questions test this

Pick the cheapest DR strategy that still meets the RTO/RPO you were given

The four canonical DR strategies trade cost against recovery speed in lockstep: backup-and-restore recovers in hours and costs the least, pilot-light in minutes to hours, warm-standby in minutes, and multi-site active-active in near-zero time at the highest cost. Because faster recovery always costs more, pin down the RTO (how fast you must be back) and RPO (how much data you can lose) the requirement actually demands, then take the least expensive option that clears both. Overshooting the requirement just burns money on recovery speed nobody asked for.

Trap Reaching for active-active just because its recovery numbers are best: it's also the priciest, so it's the wrong call whenever a looser RTO/RPO would do.

1 question tests this

A company runs an internal reporting application on Amazon EC2 instances backed by an Amazon Aurora MySQL DB cluster in a single AWS…

The default stateless tier is an ASG across at least 2 AZs behind an ELB with health checks

The default stateless tier is an Auto Scaling Group spread over at least 2 AZs behind an ALB or NLB, with health checks on the target group, so an unhealthy instance is drained and automatically replaced and an AZ failure just shifts load to the survivors. The ASG's min/max/desired values set the capacity bounds, and target-tracking or step scaling policies move desired capacity in response to demand. This combination (redundancy across AZs plus self-healing plus elasticity) is the baseline almost every stateless web tier should start from.

Trap Placing the Auto Scaling Group in a single AZ: spreading instances within one zone gives no protection when that AZ fails; the baseline spans at least two.

18 questions test this

Match the Route 53 routing policy to what you're actually trying to do

Route 53 offers eight routing policies (IP-based routing was added in November 2022), each encoding a different intent, so match the policy to what you're trying to do. Simple returns one answer; Weighted splits traffic by proportion for A/B or gradual rollout; Latency sends users to the region with the lowest response time; Failover does active/passive via a health check; Geolocation routes by the user's country (compliance or localized content); Geoproximity routes by geographic distance with a bias knob to expand or shrink a region's pull; IP-based routes by the client's source CIDR block; and Multi-value does DNS-level load balancing across healthy answers. Reading the requirement for its underlying intent tells you which to pick.

Trap Using Geolocation to send people to the lowest-latency region: Geolocation routes by where the user is, while Latency routing is the one that optimizes for response time.

11 questions test this

RDS Multi-AZ failover flips DNS over to the standby in about 60-120 s

RDS Multi-AZ automatic failover^[1] repoints the database's DNS endpoint at the standby, so a client that cached the old DNS entry sees roughly 60-120 s of errors and must reconnect once its connection drops. Design clients to reconnect rather than assume the address is stable. Multi-AZ is purely an availability feature, not a scaling one: the standby is not readable while it stands by. To offload read traffic rather than survive an AZ failure, that's a separate feature, Read Replicas^[13].

Trap Pointing read traffic at the Multi-AZ standby to take load off the primary: the standby serves no reads, so you need a Read Replica for that.

4 questions test this

For cross-region RPO under a second, use Aurora Global Database

When you need cross-region RPO under a second, reach for Aurora Global Database^[2]: it replicates from a primary region to secondaries over a dedicated, purpose-built network, which is why RPO is typically under a second and failover RTO is around a minute via managed promotion. Secondary regions are read-only, good for serving low-latency reads close to users, but any one can be promoted to a writer when you need to fail the whole workload over. That managed, sub-second cross-region replication is a large step up from hand-wiring cross-region read replicas yourself.

Trap Hand-wiring cross-region read replicas for sub-second cross-region RPO: they lag more and lack the one-click managed promotion Aurora Global Database is built for.

2 questions test this

Route 53 health checks default to a 30 s interval and 3 failures before unhealthy

Route 53 health checks default to a 30 s interval^[14] with both the healthy and unhealthy thresholds at 3, so by default an endpoint must fail three consecutive checks before Route 53 marks it unhealthy and stops returning it. To fail over faster you can drop to a 10 s interval, which costs extra, and 'calculated' health checks combine several endpoint checks with AND/OR logic for a more nuanced healthy condition. Tune the interval and thresholds to balance failover speed against false positives from a single transient blip.

5 questions test this

Use the ELB health check type, not EC2, so you catch app-layer failures

ASG HealthCheckType=EC2^[15] only replaces instances the EC2 host itself flags as unhealthy. That catches hardware and hypervisor faults but not a wedged application, since a crashed app on a healthy host still passes the EC2 check. Switch to HealthCheckType=ELB and the ASG also replaces any instance that fails the load balancer's health check, catching application-layer failures like a hung process or a 500-returning endpoint. That's the setting you want in production, where 'the box is up but the app is dead' is the failure that actually hurts.

Trap Leaving the ASG on the default EC2 health check expecting it to recycle instances whose app has crashed. A hung process on healthy hardware still reports healthy and never gets replaced.

10 questions test this

S3 CRR replicates new objects asynchronously, filtered by prefix or tag

S3 Cross-Region Replication copies new objects^[4] from the source bucket to a destination bucket in another region asynchronously, usually within seconds, and you can scope it to just the objects matching a prefix or tag instead of the whole bucket. Versioning must be enabled on both buckets because replication tracks object versions. The key limitation is that CRR is forward-looking only: it acts on objects written after you turn it on, so anything that already existed needs a one-time S3 Batch Replication job to backfill.

Trap Assuming objects that were already there before you enabled CRR get copied: replication only touches objects written afterward, so the older ones need S3 Batch Replication.

8 questions test this

Centralize backups across services and accounts with AWS Backup

AWS Backup centralizes backups across services and accounts: you define backup plans and resource selections^[8] once and apply them across many services (RDS, DynamoDB, EFS, EBS, FSx, Storage Gateway and more) including cross-region and cross-account copies for DR, rather than configuring backups service by service. That central plane makes consistent retention and scheduling manageable at scale. To prove compliance or guarantee immutability, Backup Audit Manager reports on policy adherence and Backup Vault Lock enforces write-once retention that even an admin can't shorten.

1 question tests this

A company is implementing a backup and restore disaster recovery strategy across several AWS accounts in an organization. The company wants…

Route 53 failover only serves the secondary if the PRIMARY has a health check

Route 53 active/passive failover routing^[16] watches a health check on the primary record: while the primary is healthy Route 53 returns it, and only when that check fails does it serve the secondary. The health check therefore has to live on the primary record, because that's the signal that tells Route 53 it's time to fail away. Leave the primary without a health check and Route 53 has nothing to react to, so it keeps returning the primary forever even when the primary is down.

Trap Attaching the health check to the secondary record and expecting failover: it has to be on the primary, or Route 53 never fails away from it.

5 questions test this

Set the ASG health check grace period to at least your app's startup time

The ASG health check grace period tells Auto Scaling how long to wait after a new instance reaches InService before it begins evaluating that instance's health, which exists because an app needs time to boot before it can pass checks. Set it to at least how long your application takes to start: if the grace period is shorter, the ELB health checks fail while the app is still coming up, the ASG concludes the instance is bad and terminates it, and you fall into a loop of launching and killing instances that never get a chance to go healthy. Sizing the grace period to real startup time breaks that churn.

14 questions test this

Set the ALB deregistration delay to at least your longest request so targets drain cleanly

The ALB deregistration (connection-draining) delay, default 300 s, configurable from 0 to 3600 s, is how long the ALB waits before it finishes removing a target during scale-in: it immediately stops sending new requests to the removed target but holds off completing removal until the delay elapses. That window lets in-flight requests complete gracefully. Set the delay to at least your longest expected request time so long-running responses finish instead of being severed mid-flight, which would otherwise hand clients an HTTP 5xx during an ordinary scale-in or deployment.

Trap Setting the deregistration delay to 0 (or under your longest request) just to scale in faster: long in-flight requests get cut off and clients see 5xx errors.

7 questions test this

Use ALB slow start to warm up targets you've just registered

ALB slow start gradually ramps a freshly registered target's share of requests up to its full proportion over a window you choose, anywhere from 30 to 900 s, instead of hitting it with full traffic the instant it's healthy. Reach for it when an instance needs to warm up before it can perform (just-in-time cache warming, loading a large dataset into memory, or JIT compilation kicking in). During the ramp the target takes a smaller slice while it gets up to speed, so users don't pay for its cold-start latency, and once warm it joins normal balancing.

Trap Expecting connection draining (the deregistration delay) to ease a new target in: draining handles de-registration; slow start is what ramps a freshly registered target up.

7 questions test this

ALB cross-zone load balancing is always on at the LB; you can only turn it off per target group

ALB cross-zone load balancing is always enabled at the load balancer level with no switch to disable it there. The only place you can turn it off is per target group, which overrides the LB default for that group. While it's on, every LB node spreads traffic evenly across all registered targets in every enabled AZ, so an AZ with fewer targets still gets its fair share rather than overloading its handful of instances. This even spreading is why ALB defaults it on, whereas an NLB leaves it off by default and lets you toggle it at the LB.

Trap Trying to disable cross-zone balancing at the ALB level the way you can on an NLB: for an ALB it's fixed on at the LB, and the target group is the only place you can change it.

4 questions test this

An NLB gives you one static IP per AZ, and you can add Elastic IPs for fixed addresses

An NLB automatically provisions one static IP per enabled AZ, and for internet-facing NLBs you can assign your own Elastic IP per AZ, giving clients fixed addresses they can allowlist in downstream firewalls. It operates at Layer 4 (TCP/UDP), delivers ultra-low latency, and preserves the client's source IP by default, which matters when the backend needs to see who's actually connecting. Choose the NLB over an ALB whenever a stable IP or raw L4 performance is the requirement, since the ALB gives you neither.

Trap Expecting an ALB to give you a static IP for firewall allowlisting: ALB IPs are dynamic, and only the NLB offers per-AZ static / Elastic IPs.

6 questions test this

Latency routing plus Evaluate Target Health gives you active-active multi-region failover

Set Evaluate Target Health to Yes on latency-based records to get active-active multi-region failover: while everything's healthy every region serves its own lowest-latency traffic, and the moment a region's resources turn unhealthy Route 53 stops routing users there and they fall to the next-closest healthy region. ETH is what makes the health of the underlying resources actually drive DNS: without it, latency routing keeps sending users to the nearest region even after it's down. If you've layered latency over weighted records, turning ETH on at the top-level alias means Route 53 calls a region unhealthy only once all its underlying weighted records have failed.

Trap Leaving Evaluate Target Health off on latency records: Route 53 will keep sending users to the nearest region even after its endpoints are down.

11 questions test this

Roll up many endpoint checks with a Route 53 calculated health check

A Route 53 calculated health check doesn't probe an endpoint itself; it watches a set of child health checks and reports healthy as long as the number of healthy children meets a threshold you define. That lets you express a quorum-style condition (stay healthy as long as at least 2 of 6 servers are up) so DNS failover fires only when a meaningful chunk of capacity is gone rather than overreacting to any single endpoint blipping out. It's the tool for turning many individual checks into one aggregate health signal.

5 questions test this

Weighted records plus health checks already give you active-active failover

Weighted records with health checks already give you active-active failover: you don't need the dedicated Failover policy, since any routing policy other than Failover becomes active-active once you attach health checks to its records. With weighted records, Route 53 splits traffic by the configured weights while everything's healthy, then drops any record whose health check fails and redistributes its share among the remaining healthy records. A record set to weight zero acts as a pure standby: it receives no traffic until every nonzero-weight record has gone unhealthy, at which point Route 53 falls back to it.

Trap Assuming you need the Failover policy to get failover: any non-Failover policy with health checks attached already drops unhealthy records on its own.

5 questions test this

Multi-tier DNS: latency alias to pick the region, weighted records within it

Multi-tier DNS nests routing policies for multi-region: latency alias records at the top choose the best region for each user, and each points at weighted records inside that region to spread traffic across local resources. Turn on Evaluate Target Health at the latency alias and health cascades up the tree: Route 53 counts a region healthy only when at least one of its weighted children is healthy, so a region with all children down is removed from latency routing automatically. Layering the policies this way optimizes region choice and in-region distribution at once.

Trap Leaving Evaluate Target Health off the latency alias: without it health doesn't cascade, and Route 53 keeps routing users to a region whose endpoints are all down.

2 questions test this

Aurora failover runs through priority tiers from 0 first to 15 last

Aurora failover runs through promotion priority tiers: every Aurora Replica carries a tier from 0 to 15, and when the writer fails Aurora promotes whichever available replica holds the lowest tier number (tier 0 first, tier 15 last). Use that to control which replica becomes the new writer: give tier 0 to your preferred standby, for example one whose instance class matches the primary so capacity doesn't drop on promotion, and push replicas reserved for analytics or reporting into higher tiers so they're promoted only as a last resort. Ties at the same tier are broken by the largest instance size.

Trap Reading a higher tier number as higher priority: tier 0 is promoted first, so the biggest number is promoted last.

3 questions test this

Copy RDS snapshots cross-region or cross-account to seed your DR

RDS automatic snapshots are pinned to the region and account where they were taken and can't be moved directly, which makes them unsuitable on their own for cross-region or cross-account DR. Manual snapshots, or copies you make of an automatic one^[17], can be copied cross-region (re-encrypted with a KMS key in the destination region) or shared cross-account, and that copied snapshot is exactly what seeds a pilot-light or warm-standby environment in the recovery region. So the DR-ready artifact is always a manual or copied snapshot, never the raw automatic one.

Trap Relying on automatic snapshots for cross-region DR: they can't be copied or shared directly, so you have to make a manual snapshot or copy first.

Design High-Performing Architectures

High-Performing and Scalable Storage

Read full chapter

Pick storage by how it's accessed: object, file, or block
Match the EBS volume type to whatever the workload is bound on
EFS performance mode and throughput mode are tuned separately
gp3 is the default, and you can migrate gp2 in place to save ~20%
For >80,000 IOPS or 99.999% durability, reach for io2 Block Express
Instance store is free but vanishes on stop, terminate, or hibernate
FSx for Lustre gives you HPC scratch at TB/s-class throughput, linked to S3
EFS Bursting banks throughput credits while idle and spends them under load
EBS Multi-Attach works only on io1/io2, and only with a cluster-aware filesystem
For AD-integrated SMB shares, use FSx for Windows or NetApp ONTAP
When the access pattern is unknown or shifting, use S3 Intelligent-Tiering
For cold archive that still needs millisecond access, use Glacier Instant Retrieval
S3 Transfer Acceleration needs a dot-free, DNS-compliant bucket name

Unlock with Premium — includes all practice exams and the complete study guide.

High-Performing and Elastic Compute

Read full chapter

Match the instance family to the workload's bottleneck resource
On the compute spectrum, pick the most-managed option that still meets your control needs
Exceeding any Lambda ceiling (15 min / 10 GB RAM / 10 GB /tmp) → move to Fargate, Batch, or EC2
Latency-sensitive Lambda → provisioned concurrency to eliminate cold starts
Choose the placement group by goal: Cluster for latency, Spread for isolation, Partition for distributed stores
Tightly-coupled HPC / multi-node ML training → add an Elastic Fabric Adapter
Sustained-load workload → don't run it on a burstable T-family instance
Java / Python / .NET Lambda with slow cold starts → enable SnapStart (free for Java)
Drain Spot capacity on the rebalance recommendation, not the 2-minute notice
In an ECS capacity provider strategy, base guarantees a minimum and weight splits the remainder
Tune ECS target-tracking with a short scale-out and long scale-in cooldown
Make EKS Cluster Autoscaler prefer one node group → priority expander
Run setup/cleanup during scaling → ASG lifecycle hooks

Unlock with Premium — includes all practice exams and the complete study guide.

High-Performing Databases

Read full chapter

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Pick the database engine by access pattern, not by data model

Database engine is chosen by how the data will be queried, not by the data model it superficially resembles, because the access pattern determines performance at scale. Use relational (RDS, Aurora) for transactions, joins, and ad-hoc queries; key-value (DynamoDB) for predictable single-item reads and writes at any scale; document (DocumentDB) for JSON-shaped data; and search (OpenSearch) for full-text. For specialized shapes the purpose-built engines win outright. Timestream for time-series and Neptune for graph traversal, because they index and store the data the way those queries actually walk it.

Trap Forcing every workload onto a relational engine just because the data looks like rows: full-text search, graph traversal, and at-scale key-value lookups all underperform on RDS compared to the engine built for them.

1 question tests this

A company is building a platform that ingests telemetry from millions of IoT sensors. The workload is extremely write-intensive with…

Scale reads with replicas; scale writes with sharding

Reads and writes scale by different mechanisms, so match the fix to the bottleneck. Read replicas multiply read throughput by serving queries off copies: RDS gives up to 15 async Read Replicas per source, and Aurora Replicas go to 15 too but with typically <100 ms lag. Yet none absorb a single write. Because the single writer still caps write throughput, you scale writes by spreading them out: design DynamoDB partition keys for high cardinality and lean on adaptive capacity, or shard a relational workload across multiple Aurora clusters by tenant or key.

Trap Adding read replicas to fix a write bottleneck: they only absorb read traffic, so the single writer still caps how fast you can write.

1 question tests this

A solutions architect is designing a high-throughput web application that uses an Amazon Aurora PostgreSQL DB cluster. The application has…

Default to Aurora unless you need stock RDS Oracle/SQL Server

Aurora is the default relational choice: wire-compatible with MySQL and PostgreSQL, it replaces RDS's single-volume storage with a distributed layer that keeps 6 copies across 3 AZs, fails over in under 30 s, and offers Aurora Serverless v2 for on-demand scaling. Its storage grows automatically^[11] up to 256 TiB on current engine versions (128 TiB on older ones) with no manual resize, no downtime, and no provisioning, and because you pay only for what you use, dropping a table actually shrinks billed storage. Stay on stock RDS only when you specifically need the Oracle or SQL Server engine, which Aurora doesn't run.

Trap Assuming you must pre-provision Aurora storage and resize it later like classic RDS: Aurora scales storage on its own, so that capacity planning is wasted effort.

Need microsecond DynamoDB reads → put DAX in front

DAX^[15] is a managed in-memory cache purpose-built for DynamoDB, the answer when single-digit-millisecond reads aren't fast enough: it returns cached reads in microseconds while writes pass straight through DAX to the table. It's read-through and write-through/write-around, eventually consistent by default, runs inside your VPC, and speaks the DynamoDB API itself, so your application points at DAX and the code barely changes. That API compatibility sets it apart from a generic cache, where you'd write and maintain the caching logic yourself.

Trap Reaching for ElastiCache to cache DynamoDB: DAX is the DynamoDB-native cache that speaks the same API, whereas ElastiCache leaves you to manage the cache-aside pattern yourself.

7 questions test this

Aurora replicas lag <100 ms (often <10 ms) via shared storage

Aurora replicas stay far fresher than RDS async replicas because of how they replicate: they read directly from the shared distributed storage layer^[11] the writer already persisted to, rather than replaying a shipped binary log. Removing the log-shipping pipeline entirely puts replica lag typically under 100 ms and often under 10 ms. You can run up to 15 of them, the reader endpoint load-balances across the set, and on writer failure a replica is promoted in roughly a minute.

4 questions test this

DynamoDB throttling → fix the partition key cardinality first

DynamoDB throttling points first at partition key cardinality, because the table spreads its throughput across physical partitions by hashing that key. Adaptive capacity^[16] automatically shifts capacity toward busy partitions and absorbs mild skew, but it can't rescue a low-cardinality key that funnels traffic onto one hot partition. Fix it at the source by choosing high-cardinality keys, UUIDs, hashes, or composite keys like tenant#item, so reads and writes fan out evenly instead of concentrating.

Trap Adding a GSI to relieve a hot partition: a GSI lets you query other attributes but does nothing for hot-key writes on the base table.

ElastiCache: default to Redis unless you only need simple multi-threaded caching

Between the two ElastiCache engines, Redis^[17] is the default because it's far more than a cache: rich data structures (lists, sets, sorted sets, streams, geo, hyperloglog), plus pub/sub, persistence, replication, cluster-mode sharding, and transactions. Memcached is deliberately minimal (plain multi-threaded key-value with auto-discovery, no persistence and no replication) which gives it an edge only on simple, horizontally-scaled caching of opaque values. So pick Memcached when multi-threaded simple caching is the whole requirement, Redis for anything needing those richer capabilities.

Trap Picking Memcached when the requirement calls for persistence, replication, or sorted-set/pub-sub features. It has none of those, so it simply can't satisfy them.

17 questions test this

Need DynamoDB change-data-capture → enable DynamoDB Streams

DynamoDB Streams^[18] is DynamoDB's built-in change-data-capture feed, the way to react to every insert, update, and delete: it captures each as an ordered, per-item sequence of stream records, with 4 view types (KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES) so you choose how much of the before/after image you receive. The usual consumer is Lambda, which reacts to each change and fans it out downstream, and the same stream mechanism powers Global Tables' cross-region replication under the hood. Records are retained for 24 hours, which shapes how the consumer must be designed.

Trap Counting on Streams as durable event storage: records expire after 24h, so a consumer that's down longer permanently loses those changes.

Redshift joins: KEY co-locates large↔large; ALL replicates small dimensions

Redshift's distribution style decides where rows physically live across slices, and the goal is to avoid shuffling data over the network at join time. For a join between two large tables, set DISTSTYLE KEY on the shared join column in both so matching rows land on the same slice and Redshift skips redistribution. For a small, slowly-changing dimension (typically under a few million rows), use DISTSTYLE ALL to keep a full copy on every node, making joins on any column local without moving data. EVEN is the default and spreads rows blindly, so it's rarely best once you know your join patterns.

Trap Applying DISTSTYLE ALL to a large table: replicating a big table onto every node wastes storage and slows writes, so ALL is only for small dimension tables.

3 questions test this

To benefit from Aurora Auto Scaling, connect to the reader endpoint

Aurora Auto Scaling adds and removes read replicas with demand, but those replicas only receive traffic when clients connect through the reader endpoint rather than individual instance endpoints. The reader endpoint is a managed DNS name that round-robins connections across all available readers and automatically begins including a new replica once it passes health checks, so the capacity Auto Scaling provisions actually gets used. Point read traffic there and scaling is transparent; point it at fixed instance endpoints and the new replicas sit idle.

Trap Hard-coding instance-specific endpoints: the new replicas Auto Scaling spins up then receive no traffic, which defeats the whole point of scaling.

10 questions test this

"Oops, dropped a table" on Aurora MySQL → Backtrack rewinds in seconds

Aurora MySQL Backtrack^[12] rewinds the existing cluster to a point up to 72 hours in the past, in seconds and without downtime, by moving the cluster back through its change records rather than restoring a backup. That makes it the fast recovery path for a logical mistake like a bad delete or accidental table drop, where the alternative (restoring from a snapshot or point-in-time) means provisioning a whole new cluster and waiting on it. Backtrack is Aurora MySQL only.

Trap Treating Backtrack like point-in-time restore: PITR provisions a separate new cluster from backups, whereas Backtrack reverts the existing cluster in place (and only on Aurora MySQL).

Put RDS Proxy in front of RDS/Aurora for serverless connection storms

When a Lambda fleet's per-invocation connections threaten to exhaust an RDS or Aurora instance's limited connection slots, put RDS Proxy in front: a fully managed connection pool between clients and the database. It reuses a small set of warm connections, queues or sheds excess requests instead of letting them overwhelm the DB, and shortens failover by keeping client connections open while it reconnects to a healthy instance. It reads the database credentials from AWS Secrets Manager and lets clients authenticate to the proxy with IAM, so no DB password sits in the function. Reach for a bigger instance or read replicas instead only when the bottleneck is real query load, not connection churn.

Trap Assuming RDS Proxy always multiplexes: session state such as a statement larger than 16 KB or a temporary table 'pins' a connection to one client, which drops back to one-connection-per-client and erases the pooling benefit.

High-Performing and Scalable Networks

Read full chapter

Front HTTP/HTTPS with CloudFront, but reach for Global Accelerator on non-HTTP traffic or fixed IPs
Use Transit Gateway once you have many VPCs, peering for a few, and Direct Connect Gateway when on-prem is in the mix
Pick the load balancer by protocol, IP-pinning, and routing intelligence
Any HTTPS endpoint can be a CloudFront origin, not just S3 or EC2
Transit Gateway is billed per attachment-hour plus per-GB processed
ALB target groups can hold instances, IPs, or Lambda functions
Route 53 latency routing uses a pre-measured edge-to-region table, not live pings
Use CloudFront Functions for lightweight edge logic and Lambda@Edge for the heavy stuff
Global Accelerator: traffic dials shift between regions, weights shift within one
API Gateway: HTTP API by default, REST for advanced features, WebSocket for bidirectional
Add Origin Shield to dedupe origin fetches across regional edge caches
Use Route 53 multivalue answer for DNS-layer load balancing across a small fleet
Turn on Evaluate Target Health to propagate ELB backend health into DNS failover
S3 Transfer Acceleration optimizes upload speed, never cost

Unlock with Premium — includes all practice exams and the complete study guide.

High-Performing Data Ingestion and Transformation

Read full chapter

Choose streaming vs batch by how fresh the consumer needs the data
Match the transformation tool to the data's shape and size
Firehose delivers as soon as either the time or the size buffer is hit
Size Kinesis Data Streams shards at 1 MB/s in and 2 MB/s out each
The Glue Data Catalog is the lake's single shared metastore
Need row, column, or tag-level lake permissions, reach for Lake Formation
Size Glue jobs by worker type, where a DPU is 4 vCPU plus 16 GB
DMS replicates source to target, with optional ongoing CDC
Offline bulk transfer means Snowball Edge, since Snowmobile is retired
An IoT Core rule's error action fires when its primary action fails
IoT Core Basic Ingest skips the broker, so there's no per-message charge
A Firehose Lambda transform must return recordId, status, and base64 data

Unlock with Premium — includes all practice exams and the complete study guide.

Design Cost-Optimized Architectures

Cost-Optimized Compute

Read full chapter

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Commit to reserved capacity when usage stays steady most of the year

When EC2, Fargate, or Lambda runs steadily for more than ~70% of a 1- or 3-year window, commit to capacity instead of paying on-demand: the reservation discount runs the whole term and reaches up to ~72% off on-demand at a 3-year all-upfront commitment, while on-demand carries no commitment but the highest per-hour rate. Use on-demand only for short-lived or unpredictable work where you'd rather pay the premium than over-commit. Reserved Instances tie the discount to a specific instance configuration, whereas Savings Plans commit you to a dollars-per-hour spend that applies across families and flexes as your fleet changes.

Trap Treating RIs and Savings Plans as interchangeable: RIs lock to a family and region, SPs trade some discount for cross-family flexibility, and an SP that overlaps your RIs just sits idle behind them.

14 questions test this

Run interruption-tolerant work on EC2 Spot

EC2 Spot runs on spare capacity at up to 90% off On-Demand but can be reclaimed on a 2-minute notice, so it fits stateless, batch, big-data, and CI-fleet jobs where an interruption is survivable, and nothing that can't tolerate being reclaimed mid-task, which belongs on on-demand or reserved capacity. To minimize interruptions, set the price-capacity-optimized allocation strategy (AWS's recommended choice; you must set it explicitly because the CLI/API default is still lowest-price), which draws from pools that are both cheap and deep.

Trap Running stateful workloads on Spot without checkpointing: a reclaim mid-task loses in-flight state, so persist or checkpoint first.

2 questions test this

Reach for Compute Optimizer when you need right-sizing recommendations

Compute Optimizer answers "is this resource the right size?" (analyzing actual utilization to recommend a smaller instance or different family for EC2, EBS, Lambda, and ECS-on-Fargate) rather than "should I commit?". It needs enough data to judge (a Lambda function must see at least 50 invocations in 14 days to qualify) and deliberately stays out of the purchasing decision, leaving RI and Savings Plans recommendations to Trusted Advisor and Cost Explorer.

Trap Expecting Compute Optimizer to recommend RIs or Savings Plans: those purchase recommendations come from Trusted Advisor or Cost Explorer instead.

3 questions test this

RIs apply before Savings Plans in the billing engine

The billing engine applies discounts in a fixed order each hour: RIs first against matching family/region/AZ/OS usage, then Savings Plans against the remaining eligible usage (EC2 Instance SPs apply before the broader Compute SPs, and within that, highest-discount-percentage first), then on-demand for whatever is left. That order is why a Savings Plan overlapping your RIs earns nothing: the RIs have already consumed those hours. Buy SPs to cover usage your RIs don't reach, and add RIs only for instance types whose utilization is reliably steady.

Trap Buying a fresh Compute SP that overlaps usage your RIs already cover: the RIs consume those hours first and the new SP sits idle.

1 question tests this

A media company stores 500 TB of video archive footage in S3 Glacier Deep Archive. A production team urgently needs to retrieve 50 TB of…

Climb the flexibility ladder: Standard RI to Convertible RI to Compute SP

Flexibility forms a ladder (Standard RI to Convertible RI to Compute SP) that trades a little discount for room to change. A Standard RI is locked to its family and can only be sold on the RI Marketplace, not swapped; a Convertible RI can be exchanged for a different family, OS, or tenancy without selling; and a Compute SP is the most flexible, covering EC2, Fargate, and Lambda across any family and region, at a slightly lower maximum discount than a deep RI commitment. Pick the lowest rung that still tolerates how much your workload will shift.

Trap Assuming a Standard RI can be exchanged like a Convertible. It can only be sold on the Marketplace, not swapped to another family.

13 questions test this

Use price-capacity-optimized for long-running Spot fleets, not lowest-price

On a long-running Spot fleet the allocation strategy decides how often you get interrupted, so it matters more than shaving the last cent off the hourly rate. lowest-price^[10] minimizes cost but draws from the shallowest pools, which are reclaimed first; capacity-optimized launches from the deepest, lowest-interruption pools. price-capacity-optimized (the strategy AWS recommends, and the console default for new fleets/ASGs, though the CLI/API default is still lowest-price unless you override it) balances both, and capacity-optimized-prioritized honors your priority list when HPC or ML jobs need a specific instance order.

Trap Choosing lowest-price for a long-running Spot fleet: the cheapest pools are reclaimed first, so you trade a small saving for far more interruptions.

4 questions test this

A low-traffic Lambda gets no Compute Optimizer recommendation

A quiet Lambda below 50 invocations in 14 days^[9] gets no Compute Optimizer recommendation at all, because the service needs enough recent activity to size a function and there simply isn't enough signal. When the exam says 'Compute Optimizer cannot generate a recommendation' for a quiet function, this lookback threshold is the root cause, not anything you've misconfigured.

Reach for Graviton when 'reduce cost' meets an unrestricted architecture

AWS Graviton (ARM64) instances^[11] are AWS-designed processors delivering up to 40% better price-performance than comparable x86, so when a question says 'reduce cost' and doesn't pin the architecture, Graviton is the move. The only catch is the workload must run on ARM, which most managed services already handle (Graviton is supported across RDS, Aurora, ElastiCache, Lambda, and Fargate) so an unrestricted, managed workload has nothing holding it on x86.

Trap Reaching for Spot or Reserved Instances when the question only says 'reduce cost' on an unrestricted workload: those need interruption tolerance or a commitment, while Graviton just lowers the rate.

1 question tests this

A company has multiple AWS Lambda functions that process batch data from Amazon S3. The functions run asynchronously and are not…

Run fault-tolerant containers on Fargate Spot

Fargate Spot^[12] runs tasks on spare capacity at a deep discount but can reclaim them on the same 2-minute notice as EC2 Spot, so it fits fault-tolerant containerized work like CI builds, batch jobs, and dev/test. For anything that must stay up, keep it on regular FARGATE; the standard pattern is a mixed capacity provider holding a FARGATE baseline for steady load and bursting the interruption-tolerant overflow onto FARGATE_SPOT.

Trap Putting always-on production tasks entirely on Fargate Spot: keep the steady baseline on FARGATE and use FARGATE_SPOT only for interruption-tolerant burst.

3 questions test this

Get RI / SP / right-sizing tips for free from Trusted Advisor

Trusted Advisor^[13] surfaces cost-optimization advice (an 'underutilized EC2 instances' check for right-sizing, plus 'RI optimization' and 'Savings Plans recommendations' once it has ~30 days of usage to learn from), and a core subset of those checks is free. Only that subset comes with Basic support; the full cost-optimization check set unlocks with a Business or Enterprise plan.

Trap Assuming every Trusted Advisor check is free: only a core subset is, and the full cost-optimization check set requires a Business or Enterprise support plan.

Replace at-risk Spot before the 2-minute notice with Capacity Rebalancing

Capacity Rebalancing lets an Auto Scaling group act on EC2's rebalance recommendation signal (which arrives ahead of the hard 2-minute interruption notice), launching a replacement proactively while the at-risk instance is still healthy. Waiting on the 2-minute notice alone leaves no headroom because capacity may already be gone. Pair rebalancing with lifecycle hooks so in-flight requests finish draining before the old instance terminates.

Trap Relying only on the 2-minute notice for graceful drain: by then capacity may already be gone, and the earlier rebalance recommendation is what gives you headroom.

3 questions test this

Use Enhanced Infrastructure Metrics for cyclical monthly or quarterly workloads

Enhanced Infrastructure Metrics is the paid Compute Optimizer add-on that extends the lookback to up to 93 days, capturing a full cyclical period and sizing for the peak. The right choice whenever billing or processing follows a monthly or quarterly rhythm. The default 14-day lookback of CloudWatch data can otherwise sample only a quiet trough of a workload that spikes monthly or quarterly and then recommend an instance that's too small.

Trap Trusting a default-lookback recommendation for a cyclical workload: 14 days can sample only a trough and recommend a too-small instance.

2 questions test this

Configure org-wide Compute Optimizer settings from the management account

Recommendation preferences set in the management account (approved instance families, required CPU headroom, lookback window) propagate to every member account in an AWS Organization, so you tune the policy once instead of per account. Note the coverage edges: Compute Optimizer does not rightsize Spot Instances at all, but it does cover RDS for MySQL and PostgreSQL (with Performance Insights enabled) alongside EC2, Lambda, EBS, and ECS.

Trap Expecting Spot Instance rightsizing from Compute Optimizer: it produces no recommendations for Spot, so optimize those via allocation strategy and instance flexibility instead.

3 questions test this

Use a Zonal RI, not Regional, when you need guaranteed capacity in a specific AZ

A Zonal Reserved Instance, scoped to one Availability Zone, gives both the billing discount and a capacity reservation matching the instance attributes, so instances still launch even during a peak-demand crunch in that AZ: choose it whenever guaranteed capacity in a known AZ is the requirement. A Regional Reserved Instance, by contrast, applies its billing discount flexibly across all AZs in the region but reserves no capacity, so it saves money without guaranteeing a launch slot, making it the better default when guaranteed capacity isn't required.

Trap Assuming a Regional RI guarantees a launch slot: it only discounts billing, so you need a Zonal RI (or On-Demand Capacity Reservation) to reserve actual capacity.

4 questions test this

Cost-Optimized Storage

Read full chapter

Pick a storage class by access pattern, not data volume
Known schedule → Lifecycle; unknown access pattern → Intelligent-Tiering
Before tiering down, check minimum object size and minimum storage duration
IA bills a 128 KB minimum per object: tiny files cost MORE in IA
Cold tiers charge a minimum storage duration even if you delete early
Choose the Glacier retrieval tier by RTO, not just price
Intelligent-Tiering trades a small monitoring fee for zero retrieval fees
EFS lifecycle mirrors S3 tiering with EFS-specific tier names
Archive snapshots older than 90 days you rarely restore; keep recent ones standard
Use S3 Storage Lens to find storage waste across accounts
EFS throughput: Bursting shrinks as data tiers down; Archive needs Elastic
Pick a Storage Gateway mode by where primary data must live
Athena bills per TB scanned: partition and use columnar formats
Glacier transitions can fire at 0 days, IA only after 30, and minimum-storage charges still apply

Unlock with Premium — includes all practice exams and the complete study guide.

Cost-Optimized Databases

Read full chapter

Commit capacity for predictable load, pay per use for spiky load
Put a cache in front of a saturating read-heavy DB instead of upsizing it
DynamoDB: provisioned for steady high use, on-demand for spiky
Aurora Serverless v2 only wins below ~50% utilization
ElastiCache Reserved Nodes use the same RI payment options as EC2
Stop an idle RDS instance to drop compute cost (max 7 days)
Lazy-load by default, use write-through when a miss is catastrophic
Switch Aurora to I/O-Optimized once I/O exceeds ~25% of cost
Use ElastiCache Serverless for unpredictable cache traffic (Valkey 33% cheaper)
Use ElastiCache R6gd data tiering when ≤20% of the dataset is hot (60%+ savings)
Customer-managed KMS keys cost ~$1/key/month, so consolidate where you can
Aurora Serverless v2 scales 0–256 ACUs in 0.5-ACU steps

Unlock with Premium — includes all practice exams and the complete study guide.

Cost-Optimized Network

Read full chapter

Attack data egress first when optimizing network cost
Front cacheable content with CloudFront and co-locate chatty tiers
Use a Gateway Endpoint (free) for S3/DynamoDB, an Interface Endpoint (paid) for everything else
Route AWS-bound traffic through endpoints and keep NAT for genuine internet
Cross-AZ traffic is billed at both ends, so design AZ-affinity for chatty tiers
Serve high-volume cacheable egress via CloudFront, not straight from EC2
Pick a cheaper CloudFront price class when your audience is confined to a few regions
For high-volume on-prem transfer, Direct Connect beats internet egress per GB

Unlock with Premium — includes all practice exams and the complete study guide.