Domain 2 of 4

Design Resilient Architectures

Domain · 26% of the SAA-C03 exam

Decouple, replicate, and match RTO/RPO

Resilience comes down to two moves: decouple components so one failure can't cascade, and replicate state across failure domains so losing an AZ or a Region doesn't take the whole system down with it. Two budgets decide how aggressively you replicate — RTO, how long recovery is allowed to take, and RPO, how much data loss you can tolerate. Tighter budgets cost more, so the exam answer is the cheapest pattern that still meets the stated RTO and RPO.

AZ failure is the baseline; Region failure is a deliberate DR choice

Treat a single Availability Zone failing as your default blast radius. Spread instances across AZs with an Auto Scaling group behind a load balancer, use Multi-AZ managed services like RDS standby, ElastiCache, and EFS, and lean on the services that are Regional by nature, such as S3 and DynamoDB. Surviving a whole-Region failure is a separate and more expensive decision, driven by whatever DR strategy you can justify — most workloads only need Multi-AZ, and only some need multi-Region.

Loose coupling absorbs both failure and load

Put a queue or event bus between producers and consumers and each side can fail, scale, and deploy on its own — and the buffer soaks up traffic spikes the downstream couldn't handle in real time. Reach for SQS for work queues (with a dead-letter queue to catch poison messages), SNS for fan-out, EventBridge for event routing, and Step Functions for stateful orchestration. The anti-pattern the exam contrasts this against is a chain of synchronous request/response calls.

Disaster-recovery strategies by RTO / RPO / cost

StrategyRTORPORelative costWhat runs in the recovery Region
Backup & RestoreHoursHours$Nothing — restore from S3 backups, AMIs, and snapshots after the event
Pilot LightTens of minutesMinutes$$Core data replicated live (e.g. the database); servers built but switched off until failover
Warm StandbyMinutesSeconds to minutes$$$A scaled-down but always-running copy of the full stack; scale up on failover
Multi-Site Active/ActiveNear zeroNear zero$$$$Full stack live in both Regions serving traffic (e.g. Route 53 + Aurora Global Database or DynamoDB global tables)

Subtopics in this domain