Design Resilient Architectures
Decouple, replicate, and match RTO/RPO
Resilience comes down to two moves: decouple components so one failure can't cascade, and replicate state across failure domains so losing an AZ or a Region doesn't take the whole system down with it. Two budgets decide how aggressively you replicate — RTO, how long recovery is allowed to take, and RPO, how much data loss you can tolerate. Tighter budgets cost more, so the exam answer is the cheapest pattern that still meets the stated RTO and RPO.
AZ failure is the baseline; Region failure is a deliberate DR choice
Treat a single Availability Zone failing as your default blast radius. Spread instances across AZs with an Auto Scaling group behind a load balancer, use Multi-AZ managed services like RDS standby, ElastiCache, and EFS, and lean on the services that are Regional by nature, such as S3 and DynamoDB. Surviving a whole-Region failure is a separate and more expensive decision, driven by whatever DR strategy you can justify — most workloads only need Multi-AZ, and only some need multi-Region.
Loose coupling absorbs both failure and load
Put a queue or event bus between producers and consumers and each side can fail, scale, and deploy on its own — and the buffer soaks up traffic spikes the downstream couldn't handle in real time. Reach for SQS for work queues (with a dead-letter queue to catch poison messages), SNS for fan-out, EventBridge for event routing, and Step Functions for stateful orchestration. The anti-pattern the exam contrasts this against is a chain of synchronous request/response calls.
Disaster-recovery strategies by RTO / RPO / cost
| Strategy | RTO | RPO | Relative cost | What runs in the recovery Region |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Nothing — restore from S3 backups, AMIs, and snapshots after the event |
| Pilot Light | Tens of minutes | Minutes | $$ | Core data replicated live (e.g. the database); servers built but switched off until failover |
| Warm Standby | Minutes | Seconds to minutes | $$$ | A scaled-down but always-running copy of the full stack; scale up on failover |
| Multi-Site Active/Active | Near zero | Near zero | $$$$ | Full stack live in both Regions serving traffic (e.g. Route 53 + Aurora Global Database or DynamoDB global tables) |