Design Resilient Architectures — SAA-C03 Study Guide

Resilience is two moves: decouple and replicate, sized to RTO and RPO

A resilient architecture comes down to two moves: decouple components so one failure or traffic spike can't cascade, and replicate state across failure domains so losing an Availability Zone (AZ) or a Region doesn't take the system down with it. How aggressively you replicate is set by two budgets: RTO, how long recovery may take, and RPO, how much data you can afford to lose. The exam-correct answer is almost always the cheapest pattern that still meets the stated RTO and RPO, so the classic trap is an option that buys far more durability (and cost) than the requirement actually asks for.

The domain unfolds in two steps: decouple the architecture, then make it fault-tolerant

Read this page as a map, then follow the two subtopics in order. Scalable and Loosely Coupled Architectures is the decouple move: it puts a queue or event bus between tiers (SQS, SNS, EventBridge, Kinesis, Step Functions) so each side can fail, scale, and deploy on its own, and the buffer soaks up spikes the downstream couldn't handle in real time. Highly Available and Fault-Tolerant Architectures is the replicate move: Auto Scaling across AZs behind a health-checked load balancer, Multi-AZ managed services, Route 53 routing, and the four DR strategies for surviving a whole Region. Each subtopic carries the mechanisms, worked examples, and traps; this overview just shows how they fit together.

AZ failure is the baseline; Region failure is a deliberate, more expensive choice

When in doubt, treat a single AZ failing as your default blast radius and design for it everywhere: spread instances across AZs with an Auto Scaling group behind a load balancer, lean on Multi-AZ managed services like RDS standby and on the services that are Regional by nature such as S3 and DynamoDB. Surviving a whole-Region failure is a separate, costlier decision driven by an explicit disaster-recovery strategy; most workloads only need Multi-AZ, and only some justify multi-Region. So unless a question explicitly asks to survive a regional outage or sets a strict cross-Region RTO/RPO, the Multi-AZ answer is usually the right one.

The two moves of resilience, and where each is covered

Move	Answers the question	Key services	Drill into
Decouple	How do we stop one failure (or spike) from cascading?	SQS, SNS, EventBridge, Kinesis, Step Functions	Scalable and Loosely Coupled Architectures
Replicate	How do we survive losing an AZ or a Region, within budget?	Multi-AZ (ASG + ELB, RDS Multi-AZ), Route 53, cross-Region DR (Aurora Global, DynamoDB global tables)	Highly Available and Fault-Tolerant Architectures

Subtopics

Resilience is two moves: decouple and replicate, sized to RTO and RPO

The domain unfolds in two steps: decouple the architecture, then make it fault-tolerant

AZ failure is the baseline; Region failure is a deliberate, more expensive choice

The two moves of resilience, and where each is covered

Subtopics in this domain