Highly Available and Fault-Tolerant Architectures

Auto Scaling: target tracking vs step vs scheduled, when each applies

Every high-availability design on this page is the same three-move trick: keep a redundant copy, watch a health signal, shift work to the copy when the signal fails. Two numbers grade how well you played it: RTO (recovery time objective, how quickly you must be back after a failure) and RPO (recovery point objective, how much data, measured in time, you may lose). The copy's nature sets both: synchronous copies give near-zero RPO, asynchronous copies lag, and the speed of the shift sets your RTO. That one trick scales from a self-healing stateless tier through Route 53 traffic steering and RDS, Aurora, and S3 replication up to the four disaster-recovery (DR) strategies the final section assembles.

Auto Scaling Groups (ASGs) are that model at the compute layer: they replace failed instances and scale capacity. AWS groups scaling policies into three dynamic types (target tracking, step, simple) plus scheduled and predictive, five in all, each for a different signal pattern. The decision guide below maps each signal pattern to its policy.

Target tracking scaling^[12] (the default for new ASGs):

Pick a metric + target value: e.g. 'average CPU = 50%'.
ASG auto-creates the two CloudWatch alarms (scale-out + scale-in) and holds the target.
Built-in metrics: ASGAverageCPUUtilization, ASGAverageNetworkIn/Out, ALBRequestCountPerTarget.
Custom metrics: any CloudWatch metric (e.g. queue depth).
Use for: 95% of scaling needs. Simplest; auto-tunes.

Step scaling:

Define explicit steps: 'if CPU 70-80, add 1 instance; if CPU 80-90, add 2; if CPU > 90, add 4'.
A cooldown (e.g. 60 seconds) prevents over-reacting between actions.
Use for: precise control of scaling magnitude relative to load severity.

Simple scaling (legacy): one alarm → one scaling action; superseded by step scaling, avoid in new designs.

Scheduled scaling:

Set capacity changes by clock (UTC): 'at 09:00 weekdays scale to 20 instances; at 18:00 scale to 5'.
Use for: predictable load (business hours, scheduled batch).
Stack with target tracking: scheduled sets baseline; target tracking handles deviations.

Predictive scaling: ML forecast from up to 14 days of metric history; scales BEFORE load hits; best for cyclical loads with 24h+ patterns.

Instance refresh: rolling replacement of the ASG's instances (deploy a new AMI without downtime; configurable healthy percentage, instance warmup, and skip-matching). Don't replace instances already on the new template.

ELB vs EC2 health check type:

HealthCheckType=EC2^[15]: only EC2 instance-level health (hardware fail).
HealthCheckType=ELB: includes load balancer health checks (app-level failures).
ALWAYS use ELB in production: catches app crashes that EC2 health misses.

Termination policy (which instance goes when scaling in): default balances across AZs, then OldestLaunchTemplate, then ClosestToNextInstanceHour; customizable. OldestInstance is common for canary-style rolling deploys.

Lifecycle hooks^[18]:

Pause instance launch (autoscaling:EC2_INSTANCE_LAUNCHING) while custom setup runs, or pause instance terminate (autoscaling:EC2_INSTANCE_TERMINATING) to drain connections / snapshot logs before termination.
Wait defaults to one hour (heartbeat timeout); global maximum 48 hours or 100× the heartbeat timeout, whichever is smaller.
Hook completes with CompleteLifecycleAction API.

Warm pools: pre-initialized instances kept stopped (faster launch than from scratch); for apps with long warmup (large container pulls, JIT warmup, cache priming).

Choosing among the five scaling policy types by signal pattern; target tracking is the default and covers most needs.

Route 53 routing policies catalog: 8 policies, 8 use cases

A single exam item hands you one DNS requirement — send EU users to the Frankfurt endpoint, ramp 10% of traffic onto a new version, or fail over to a standby region when the primary dies — and asks which routing policy delivers it; picking right is a lookup from intent to policy. Route 53 supports eight routing policies^[9]; the exam tests pattern recognition. Match the scenario to the policy. Health checks are the page's health signal at DNS: a record whose check fails stops being returned.

1. Simple routing:

One record name → one or more values; with multiple values Route 53 returns all in random order and the client picks.
Use for: static / single-target records (cname.example.com → fixed IP).

2. Weighted routing:

Multiple records, each with a weight (0-255); traffic distributes proportionally.
Use for: blue/green or canary deployment (90% to v1, 10% to v2; ramp up).
Weight=0 disables a record without deleting it.

3. Failover routing:

One primary + one secondary record.
Route 53 health check on the primary; if it fails, the secondary serves (DNS failover^[16]).
Use for: active/passive DR (primary us-east-1, secondary us-west-2).
Critical: the health check goes on the PRIMARY record and is required. Its failure is the only signal Route 53 reacts to; without it, Route 53 always serves the primary, even when it's down.

4. Latency-based routing:

One record per region, pointing at that region's endpoint.
Route 53 picks the region with lowest measured latency from the user's edge location.
Latency table is pre-measured (not real-time per-request).
Use for: global apps minimising latency.

5. Geolocation routing:

Route by user's location; records specify continent / country / US state.
Use for: compliance ('EU users must hit EU region'), content localisation, geo-blocking.
Always include a Default record for locations you didn't enumerate.

6. Geoproximity routing:

Routes by geographic distance, with a bias knob that expands or shrinks the region a resource draws traffic from (geoproximity routing^[19]).
Works on standard records; the bias-visualisation maps require Route 53 Traffic Flow.
Use for: shifting traffic during region failover, gradual migration between regions.

7. IP-based routing (IP-based routing^[20]):

Routes on the client's source IP using a CIDR-block-to-endpoint mapping table you upload (user-IP-to-endpoint mappings).
Use for: fine-tuning routing by known client networks / ISPs (e.g. steer specific IP ranges to specific endpoints for performance or network-cost optimisation).

8. Multi-value answer routing:

Returns up to 8 healthy records.
Client-side load balancing (the client picks one).
Each record can have a health check; only healthy records returned.
Use for: poor-man's load balancing where you don't want an actual ELB.

Combining policies: Route 53 Traffic Flow nests policies (e.g. geolocation → latency within each geo) in a visual, version-controlled editor.

Health checks:

Default check: every 30 seconds; consider healthy after 3 successes / unhealthy after 3 failures^[14].
Fast check: every 10 seconds (paid).
Types: endpoint (HTTP/HTTPS/TCP), CloudWatch alarm, calculated (AND/OR of other checks, e.g. 'healthy if at least 2 of 3 endpoints are healthy', for multi-endpoint apps).

Common SAA scenarios:

'DR failover Region' → Failover routing + health checks.
'Blue/green deployment' → Weighted routing, ramp from 0 → 100%.
'Global app, lowest latency' → Latency-based routing.
'EU compliance' → Geolocation routing.
'No ELB but load balance' → Multi-value answer routing.

Multi-AZ vs Read Replicas vs Aurora Replicas: the three RDS scaling primitives

Your RDS instance is straining — reads are piling up on the primary, and a single-AZ outage would take the whole data tier down with it — and you have to pick what to add: a standby, a read replica, or a move to Aurora. RDS and Aurora give the data tier three replication primitives; one model sorts them: a synchronous standby is for availability (it takes over on failure, serving nothing meanwhile); an asynchronous read replica is for read scale (it serves reads but lags); Aurora collapses the trade-off (its replicas share one storage volume, serving reads AND standing ready as failover targets). The exam tests which to apply. The figure below contrasts the three primitives side by side.

Multi-AZ deployment^[1] (HA), the classic Multi-AZ DB instance flavor:

Synchronous standby in a different AZ (same region).
Standby is NOT readable. It exists only for failover. (The cluster flavor below changes exactly this.)
Failover triggers: planned (instance upgrade, OS patching) or unplanned (hardware fail, AZ outage).
Failover time: typically 60-120 seconds. DNS endpoint updates to point at the standby.
Clients with cached DNS see ~60-120 s of errors; app should reconnect on connection failure.
Cost: ~2× single-AZ (you pay for the standby).

Use for: any production workload with an availability SLA, the default 'is this production?' answer.

Multi-AZ DB cluster (newer, RDS MySQL/PostgreSQL), same idea, two deltas:

One writer + two standbys across 3 AZs, and those standbys ARE readable.
Faster failover (~35 seconds typical).
Better than the instance flavor for reads + faster failover.

Read Replicas^[13] (read scaling):

Asynchronous replication from primary to N replicas (up to 15 per source for most engines).
Replicas serve read-only traffic; offload from primary.
Same AZ, different AZ, or different REGION (cross-region replication).
Cross-region replicas can be promoted to standalone read-write (used in DR).
Replication lag varies: seconds to minutes under high write load.

Use for: read-heavy workloads, geographic read distribution, DR (cross-region replica = recovery target).

Cascading replicas: a read replica can have its OWN read replicas (RDS for MariaDB, MySQL, and certain PostgreSQL versions; not Oracle, SQL Server, or Db2), multi-region read distribution without burdening the primary.

Aurora Replicas^[6] (Aurora-specific):

Up to 15 Aurora Replicas per cluster, all sharing the same storage layer.
Sub-100ms replica lag typical (often < 10 ms); shared storage means no log shipping.
Auto-promotion: if writer fails, Aurora promotes a replica → ~1 minute failover.
Reader endpoint load-balances across all Aurora Replicas.
Cross-region: a separate feature, Aurora Global Database, covered next.

Use for: any Aurora workload. Replicas serve reads AND act as warm failover targets. Beats RDS Multi-AZ + Read Replicas combined for most use cases.

Decision pattern:

'Production HA, MySQL / Oracle / SQL Server' → Multi-AZ deployment.
'Read-heavy workload, RDS' → Multi-AZ + Read Replicas (HA + read scale).
'Aurora workload, any use case' → Aurora Replicas (cover both HA and reads).
'Cross-region DR with < 1 s RPO' → Aurora Global Database (next section; NOT Read Replicas, they're async).
'Read scaling for distant users (latency)' → Cross-region Read Replicas or Aurora Global Database.

Common SAA traps:

'Multi-AZ for read scaling' → NO; the standby isn't readable (except Multi-AZ DB cluster mode, above).
'Promote read replica during a regional failure' → YES for cross-region replicas; same-region depends on use case.
'Aurora Global Database is the same as Aurora Replicas' → NO; Global Database adds cross-region replication.

The three topologies: a synchronous standby for availability, asynchronous replicas for read scale, and Aurora's shared-storage cluster that does both.

Aurora Global Database: setup, failover, cross-region read scale

Aurora Global Database replicates an Aurora cluster across regions over dedicated replication infrastructure^[2]. This section owns the comparison the previous section deferred (versus cross-region read replicas), plus architecture, setup, and the two failover paths.

vs cross-region Read Replicas:

Aurora Global Database: < 1 second cross-region RPO typical; managed failover; secondary-region clusters are full Aurora clusters (auto-scale storage, replicas, etc.).
Read Replicas: log-shipping based; lag can be seconds-minutes; manual failover by promote.

Architecture:

1 primary region (1 writer + up to 15 readers; each secondary you attach reduces that allowance by one).
Up to 10 secondary regions; a secondary cluster has no writer, so it holds up to 16 readers.
Replication is one-way: primary → secondaries, which can't accept writes (read-only; see write forwarding below).

Use cases: cross-region DR (RPO < 1 s, RTO ~1 min via managed promotion); low-latency local reads for global users; cross-region read scaling (analytics on a secondary doesn't touch primary write performance).

Setup steps:

Create an Aurora cluster in the primary region (supported Aurora MySQL / Aurora PostgreSQL versions).
'Add AWS Region'^[21] from the cluster's Actions menu; Aurora provisions the secondary on the same engine version.
Encrypted primary → encrypted secondary; you specify the primary region as the encryption source, and each region uses its own KMS key (no multi-Region key required).
Replication begins automatically.

The two failover paths are the part the exam leans on, so read the diagram alongside this split. Switchover (previously called 'managed planned failover'), zero data loss:

Use for: failover testing, primary-region maintenance, regional rotation.
Aurora waits until the chosen secondary has replicated all changes, briefly pauses writes, promotes it, and repoints replication. RPO is 0.
~ 1 minute total downtime.

Unplanned failover (cross-region), two methods^[22]:

Managed failover (recommended): fail over to a chosen secondary, accepting data loss; when the old primary region recovers, Aurora re-attaches it as a secondary automatically.
Manual detach-and-promote (when managed failover isn't possible, e.g. incompatible engine versions): remove the secondary from the global database (detaching stops replication and promotes it to a standalone read-write cluster), then rebuild the topology.
Either way: pair with Route 53 health-check failover routing; data loss possible (whatever hadn't replicated when the primary failed).

Write forwarding (newer feature):

Applications in the secondary region can issue writes; Aurora forwards them transparently to the primary.
For read-mostly apps with occasional writes; adds cross-region latency on writes, so not for high-write workloads.

Cost:

Each secondary region is a full Aurora cluster (compute + storage + I/O); replication bandwidth ≈ standard cross-region data transfer rates.
Generally 2-3× a single-region cluster.

Limitations:

Managed switchover / failover requires matching major and minor engine versions across regions.
10 secondary regions max (16 reader instances each).
Parallel Query and Aurora Serverless v2 have specific compatibility; check current docs.

Common exam scenarios:

'Cross-region DR with < 1 min failover' → Aurora Global Database.
'Global read scaling, no manual promote needed' → Aurora Global Database (secondaries serve reads).
'Need writes in multiple regions simultaneously' → NOT Aurora Global Database; use DynamoDB Global Tables (multi-active writes).

Aurora Global Database failover paths, modeled on the AWS switchover-and-failover guide: planned switchover is zero-loss; unplanned splits managed vs manual.

S3 Cross-Region Replication: filters, versioning, RTC, batch backfill

S3 Cross-Region Replication (CRR)^[4] asynchronously copies new objects from a source bucket to a destination bucket in a different region; Same-Region Replication (SRR) does the same within a region. This section covers prerequisites, what replicates, the SLA-backed variant, and backfill.

Prerequisites:

Versioning must be enabled on BOTH source and destination buckets.
IAM role with permissions to read from source + write to destination + (if encrypted) use KMS keys in both regions.
Source and destination can be in same or different accounts.

Replication rules (define what to copy):

Filter by prefix (prefix: "customer-data/"), by tag (tag: { team=production }), or both combined (prefix AND tag).
Multiple rules per bucket; priority order resolves conflicts.

The diagram below walks an object through that gauntlet. What replicates:

New objects after rule creation.
Object metadata, tags, ACLs.
Delete markers (configurable; default off).
Replica modification sync (replicas back to source) — opt-in feature for two-way scenarios.

What does NOT replicate:

Objects encrypted with SSE-C (customer-supplied keys).
Objects created before replication was enabled (backfill: S3 Batch Replication, below).
Permanent deletes (only delete markers can replicate, and only if configured).

Replication Time Control (RTC) — SLA-backed replication:

Without RTC: best-effort; typically seconds-minutes, no guarantee.
With RTC: AWS guarantees 99.99% of objects replicate within 15 minutes^[23].
RTC adds CloudWatch metrics (replication latency, missed-SLA count).
Cost: $0.015 per GB replicated (in addition to standard CRR data transfer + per-request charges).
Use for: compliance scenarios requiring guaranteed replication time, business-critical pipelines.

S3 Batch Replication (backfill existing objects):

Backfills objects that existed BEFORE rule creation, and retries objects that previously failed replication.
Uses S3 Batch Operations under the hood; one-time per backfill request, pay per object processed.

Cross-account CRR:

Destination bucket policy must allow the source account's replication role.
Object ownership: the source-account role writes the replica → source account owns it; use bucket-owner-full-control or AccessControlTranslation so the destination account owns.

KMS-encrypted CRR:

Source-side KMS key: replication role needs kms:Decrypt.
Destination-side KMS key: replication role needs kms:GenerateDataKey / kms:Encrypt.
Multi-region KMS keys (since 2021) give the same key ID in both regions, simplifying key policies — but S3 replication still decrypts and re-encrypts data keys under the destination region's key^[24], even for related multi-Region keys, so the role needs the permissions above on both sides.

Common patterns:

Data must exist in 2 regions (compliance): CRR with RTC.
DR for static content (web assets, downloads): CRR to alternate region; failover via Route 53.
Data must NOT leave a region: no CRR — use SRR to a separate AZ-resilient bucket if needed.
Analytics on a copy: CRR to a dedicated bucket near the analytics tooling.

Cost: cross-region data transfer (~$0.02/GB depending on region pair); PUT requests and storage on the destination at standard pricing; optional RTC ($0.015/GB).

Walking one object through S3 CRR eligibility: versioning gate, pre-existing needs batch backfill, SSE-C is skipped, a matching new object replicates.

DR strategy worked examples: cost + RTO + RPO trade-off across all four patterns

The four canonical DR strategies^[7] assemble everything above into one cost / recovery-time spectrum. Each maps to an RTO + RPO band (the objectives defined at the top of the page); the exam answer follows from the stated constraints. In increasing cost and decreasing RTO/RPO: backup-and-restore, pilot light, warm standby, multi-site active/active (also called multi-region active-active). Dollar figures below are illustrative orders of magnitude, not price quotes.

Worked Example 1 (Backup-and-restore, RTO: hours-days, RPO: hours):

Scenario: small business, daily backups, can tolerate ~24h of data loss + 4-8h to recover.

Architecture:

AWS Backup^[8] runs daily snapshots of RDS, EBS, DynamoDB, EFS.
Snapshots copied to alternate region nightly.
Infrastructure as code (CloudFormation / CDK) ready to deploy.
On disaster: deploy infrastructure in DR region, restore latest snapshots, repoint DNS.

Monthly cost: ~$50/month (snapshot storage + cross-region copy). RTO: 4-8 hours (deploy infra + restore data). RPO: 24 hours (last backup).

Worked Example 2 (Pilot light, RTO: 10s of minutes, RPO: minutes):

Scenario: e-commerce platform, ~30 min RTO budget, ~1 min RPO budget (last few orders may be lost).

Architecture:

Primary region: full stack (ALB + EC2 ASG + RDS Multi-AZ + ElastiCache + S3).
DR region: RDS Read Replica (continuously replicated), S3 CRR enabled, EC2 launch templates ready but no instances running, ALB pre-provisioned with empty target group.
On disaster: promote RDS replica to standalone, launch EC2 ASG (size from launch template), register with ALB, update Route 53 failover record.

Monthly cost: ~$500/month (read replica + cross-region S3 + idle infrastructure). RTO: 15-30 minutes (mostly EC2 launch + warmup). RPO: 1 minute (read replica lag).

Worked Example 3 (Warm standby, RTO: minutes, RPO: seconds):

Scenario: SaaS application, < 5 min RTO budget, < 1 min RPO budget.

Architecture:

Primary region: full stack at full capacity.
DR region: full stack at REDUCED capacity, all components running (minimum ASG = 2 instances vs 20 in primary).
RDS / Aurora Multi-AZ in DR region with cross-region replica.
DNS uses Route 53 weighted routing: 100% primary, 0% DR (or active health checks).
On disaster: scale DR ASG up (Auto Scaling triggers), shift traffic via Route 53.

Monthly cost: ~~$2 000/month (~~50% of primary, scaled down). RTO: 2-5 minutes (Auto Scaling scale-up time). RPO: < 30 seconds.

Worked Example 4 (Multi-site active/active, RTO: seconds, RPO: near-zero):

Scenario: global FinTech app, must handle region failure with no perceived downtime.

Architecture:

Primary AND secondary regions both serve traffic at full capacity (e.g. 50% to each via Route 53 latency-based routing).
Data layer: Aurora Global Database (< 1s RPO, managed failover; previous section) OR DynamoDB Global Tables (multi-active writes with last-writer-wins).
Caches: ElastiCache per region (locally scoped; sync via DynamoDB Streams or Aurora replication).
Stateless tiers replicated in both regions.
On disaster: Route 53 health checks shift traffic to the healthy region automatically.

Monthly cost: ~~$5 000+/month (~~2× single-region; full stack everywhere). RTO: seconds (Route 53 health check + DNS TTL). RPO: near-zero (continuous replication).

Picking the right strategy:

Scenario	Strategy
Cost > RTO; some downtime acceptable	Backup-and-restore
Moderate cost; 15-30 min RTO acceptable	Pilot light
Low RTO required; SLA matters	Warm standby
Mission-critical; near-zero RTO	Multi-site active/active

Common exam phrasing → answer:

'lowest cost DR' → Backup-and-restore.
'pre-provisioned DB, on-demand compute' → Pilot light.
'scaled-down but running stack' → Warm standby.
'no perceived downtime, full stack everywhere' → Multi-site active/active.

The four DR strategies as a spectrum, modeled on the AWS Disaster Recovery whitepaper cited in this section: cost rises while RTO and RPO tighten.

DR strategies compared

Strategy	RTO	RPO	Cost	Pattern
Backup-and-restore	Hours to days	Hours (last backup)	$	Backups in S3 / AWS Backup. Restore = build infra + restore data on failover.
Pilot light	10s of minutes	Minutes (DB replicated continuously)	$$	DB replicated to DR region; compute templates ready but stopped; start on failover.
Warm standby	Minutes	Seconds to minutes	$$$	Downsized but running stack in DR region; scale up on failover.
Multi-site active/active	Seconds	Near-zero	$$$$	Full stack running in 2+ regions; Route 53 latency / multi-value routing; data via Aurora Global / DynamoDB Global Tables.

Decision tree

Cheat sheet

Sharp facts the exam loves — give these one last read before exam day.

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Multi-AZ keeps you up inside a region; Multi-Region survives losing the whole region

Multi-AZ delivers in-region high availability: you spread stateless tiers across at least 2 AZs behind a load balancer, and RDS Multi-AZ adds a synchronous standby in another AZ that takes over in 60-120 s. Together that survives any single AZ going down without losing data. Multi-Region is a far bigger commitment, bringing real latency, cost, and replication complexity, so reach for it only when losing an entire region is actually in your threat model. Most availability requirements are satisfied by Multi-AZ; Multi-Region is for regional disaster recovery, not everyday uptime.

Trap Treating Multi-AZ as disaster recovery. It covers you when one AZ goes down, not when the whole region does.

3 questions test this

Pick the cheapest DR strategy that still meets the RTO/RPO you were given

The four canonical DR strategies trade cost against recovery speed in lockstep: backup-and-restore recovers in hours and costs the least, pilot-light in minutes to hours, warm-standby in minutes, and multi-site active-active in near-zero time at the highest cost. Because faster recovery always costs more, pin down the RTO (how fast you must be back) and RPO (how much data you can lose) the requirement actually demands, then take the least expensive option that clears both. Overshooting the requirement just burns money on recovery speed nobody asked for.

Trap Reaching for active-active just because its recovery numbers are best: it's also the priciest, so it's the wrong call whenever a looser RTO/RPO would do.

1 question tests this

A company runs an internal reporting application on Amazon EC2 instances backed by an Amazon Aurora MySQL DB cluster in a single AWS…

The default stateless tier is an ASG across at least 2 AZs behind an ELB with health checks

The default stateless tier is an Auto Scaling Group spread over at least 2 AZs behind an ALB or NLB, with health checks on the target group, so an unhealthy instance is drained and automatically replaced and an AZ failure just shifts load to the survivors. The ASG's min/max/desired values set the capacity bounds, and target-tracking or step scaling policies move desired capacity in response to demand. This combination (redundancy across AZs plus self-healing plus elasticity) is the baseline almost every stateless web tier should start from.

Trap Placing the Auto Scaling Group in a single AZ: spreading instances within one zone gives no protection when that AZ fails; the baseline spans at least two.

18 questions test this

Match the Route 53 routing policy to what you're actually trying to do

Route 53 offers eight routing policies (IP-based routing was added in November 2022), each encoding a different intent, so match the policy to what you're trying to do. Simple returns one answer; Weighted splits traffic by proportion for A/B or gradual rollout; Latency sends users to the region with the lowest response time; Failover does active/passive via a health check; Geolocation routes by the user's country (compliance or localized content); Geoproximity routes by geographic distance with a bias knob to expand or shrink a region's pull; IP-based routes by the client's source CIDR block; and Multi-value does DNS-level load balancing across healthy answers. Reading the requirement for its underlying intent tells you which to pick.

Trap Using Geolocation to send people to the lowest-latency region: Geolocation routes by where the user is, while Latency routing is the one that optimizes for response time.

11 questions test this

RDS Multi-AZ failover flips DNS over to the standby in about 60-120 s

RDS Multi-AZ automatic failover^[1] repoints the database's DNS endpoint at the standby, so a client that cached the old DNS entry sees roughly 60-120 s of errors and must reconnect once its connection drops. Design clients to reconnect rather than assume the address is stable. Multi-AZ is purely an availability feature, not a scaling one: the standby is not readable while it stands by. To offload read traffic rather than survive an AZ failure, that's a separate feature, Read Replicas^[13].

Trap Pointing read traffic at the Multi-AZ standby to take load off the primary: the standby serves no reads, so you need a Read Replica for that.

4 questions test this

For cross-region RPO under a second, use Aurora Global Database

When you need cross-region RPO under a second, reach for Aurora Global Database^[2]: it replicates from a primary region to secondaries over a dedicated, purpose-built network, which is why RPO is typically under a second and failover RTO is around a minute via managed promotion. Secondary regions are read-only, good for serving low-latency reads close to users, but any one can be promoted to a writer when you need to fail the whole workload over. That managed, sub-second cross-region replication is a large step up from hand-wiring cross-region read replicas yourself.

Trap Hand-wiring cross-region read replicas for sub-second cross-region RPO: they lag more and lack the one-click managed promotion Aurora Global Database is built for.

2 questions test this

Route 53 health checks default to a 30 s interval and 3 failures before unhealthy

Route 53 health checks default to a 30 s interval^[14] with both the healthy and unhealthy thresholds at 3, so by default an endpoint must fail three consecutive checks before Route 53 marks it unhealthy and stops returning it. To fail over faster you can drop to a 10 s interval, which costs extra, and 'calculated' health checks combine several endpoint checks with AND/OR logic for a more nuanced healthy condition. Tune the interval and thresholds to balance failover speed against false positives from a single transient blip.

5 questions test this

Use the ELB health check type, not EC2, so you catch app-layer failures

ASG HealthCheckType=EC2^[15] only replaces instances the EC2 host itself flags as unhealthy. That catches hardware and hypervisor faults but not a wedged application, since a crashed app on a healthy host still passes the EC2 check. Switch to HealthCheckType=ELB and the ASG also replaces any instance that fails the load balancer's health check, catching application-layer failures like a hung process or a 500-returning endpoint. That's the setting you want in production, where 'the box is up but the app is dead' is the failure that actually hurts.

Trap Leaving the ASG on the default EC2 health check expecting it to recycle instances whose app has crashed. A hung process on healthy hardware still reports healthy and never gets replaced.

10 questions test this

S3 CRR replicates new objects asynchronously, filtered by prefix or tag

S3 Cross-Region Replication copies new objects^[4] from the source bucket to a destination bucket in another region asynchronously, usually within seconds, and you can scope it to just the objects matching a prefix or tag instead of the whole bucket. Versioning must be enabled on both buckets because replication tracks object versions. The key limitation is that CRR is forward-looking only: it acts on objects written after you turn it on, so anything that already existed needs a one-time S3 Batch Replication job to backfill.

Trap Assuming objects that were already there before you enabled CRR get copied: replication only touches objects written afterward, so the older ones need S3 Batch Replication.

8 questions test this

Centralize backups across services and accounts with AWS Backup

AWS Backup centralizes backups across services and accounts: you define backup plans and resource selections^[8] once and apply them across many services (RDS, DynamoDB, EFS, EBS, FSx, Storage Gateway and more) including cross-region and cross-account copies for DR, rather than configuring backups service by service. That central plane makes consistent retention and scheduling manageable at scale. To prove compliance or guarantee immutability, Backup Audit Manager reports on policy adherence and Backup Vault Lock enforces write-once retention that even an admin can't shorten.

1 question tests this

A company is implementing a backup and restore disaster recovery strategy across several AWS accounts in an organization. The company wants…

Route 53 failover only serves the secondary if the PRIMARY has a health check

Route 53 active/passive failover routing^[16] watches a health check on the primary record: while the primary is healthy Route 53 returns it, and only when that check fails does it serve the secondary. The health check therefore has to live on the primary record, because that's the signal that tells Route 53 it's time to fail away. Leave the primary without a health check and Route 53 has nothing to react to, so it keeps returning the primary forever even when the primary is down.

Trap Attaching the health check to the secondary record and expecting failover: it has to be on the primary, or Route 53 never fails away from it.

5 questions test this

Set the ASG health check grace period to at least your app's startup time

The ASG health check grace period tells Auto Scaling how long to wait after a new instance reaches InService before it begins evaluating that instance's health, which exists because an app needs time to boot before it can pass checks. Set it to at least how long your application takes to start: if the grace period is shorter, the ELB health checks fail while the app is still coming up, the ASG concludes the instance is bad and terminates it, and you fall into a loop of launching and killing instances that never get a chance to go healthy. Sizing the grace period to real startup time breaks that churn.

14 questions test this

Set the ALB deregistration delay to at least your longest request so targets drain cleanly

The ALB deregistration (connection-draining) delay, default 300 s, configurable from 0 to 3600 s, is how long the ALB waits before it finishes removing a target during scale-in: it immediately stops sending new requests to the removed target but holds off completing removal until the delay elapses. That window lets in-flight requests complete gracefully. Set the delay to at least your longest expected request time so long-running responses finish instead of being severed mid-flight, which would otherwise hand clients an HTTP 5xx during an ordinary scale-in or deployment.

Trap Setting the deregistration delay to 0 (or under your longest request) just to scale in faster: long in-flight requests get cut off and clients see 5xx errors.

7 questions test this

Use ALB slow start to warm up targets you've just registered

ALB slow start gradually ramps a freshly registered target's share of requests up to its full proportion over a window you choose, anywhere from 30 to 900 s, instead of hitting it with full traffic the instant it's healthy. Reach for it when an instance needs to warm up before it can perform (just-in-time cache warming, loading a large dataset into memory, or JIT compilation kicking in). During the ramp the target takes a smaller slice while it gets up to speed, so users don't pay for its cold-start latency, and once warm it joins normal balancing.

Trap Expecting connection draining (the deregistration delay) to ease a new target in: draining handles de-registration; slow start is what ramps a freshly registered target up.

7 questions test this

ALB cross-zone load balancing is always on at the LB; you can only turn it off per target group

ALB cross-zone load balancing is always enabled at the load balancer level with no switch to disable it there. The only place you can turn it off is per target group, which overrides the LB default for that group. While it's on, every LB node spreads traffic evenly across all registered targets in every enabled AZ, so an AZ with fewer targets still gets its fair share rather than overloading its handful of instances. This even spreading is why ALB defaults it on, whereas an NLB leaves it off by default and lets you toggle it at the LB.

Trap Trying to disable cross-zone balancing at the ALB level the way you can on an NLB: for an ALB it's fixed on at the LB, and the target group is the only place you can change it.

4 questions test this

An NLB gives you one static IP per AZ, and you can add Elastic IPs for fixed addresses

An NLB automatically provisions one static IP per enabled AZ, and for internet-facing NLBs you can assign your own Elastic IP per AZ, giving clients fixed addresses they can allowlist in downstream firewalls. It operates at Layer 4 (TCP/UDP), delivers ultra-low latency, and preserves the client's source IP by default, which matters when the backend needs to see who's actually connecting. Choose the NLB over an ALB whenever a stable IP or raw L4 performance is the requirement, since the ALB gives you neither.

Trap Expecting an ALB to give you a static IP for firewall allowlisting: ALB IPs are dynamic, and only the NLB offers per-AZ static / Elastic IPs.

6 questions test this

Latency routing plus Evaluate Target Health gives you active-active multi-region failover

Set Evaluate Target Health to Yes on latency-based records to get active-active multi-region failover: while everything's healthy every region serves its own lowest-latency traffic, and the moment a region's resources turn unhealthy Route 53 stops routing users there and they fall to the next-closest healthy region. ETH is what makes the health of the underlying resources actually drive DNS: without it, latency routing keeps sending users to the nearest region even after it's down. If you've layered latency over weighted records, turning ETH on at the top-level alias means Route 53 calls a region unhealthy only once all its underlying weighted records have failed.

Trap Leaving Evaluate Target Health off on latency records: Route 53 will keep sending users to the nearest region even after its endpoints are down.

11 questions test this

Roll up many endpoint checks with a Route 53 calculated health check

A Route 53 calculated health check doesn't probe an endpoint itself; it watches a set of child health checks and reports healthy as long as the number of healthy children meets a threshold you define. That lets you express a quorum-style condition (stay healthy as long as at least 2 of 6 servers are up) so DNS failover fires only when a meaningful chunk of capacity is gone rather than overreacting to any single endpoint blipping out. It's the tool for turning many individual checks into one aggregate health signal.

5 questions test this

Weighted records plus health checks already give you active-active failover

Weighted records with health checks already give you active-active failover: you don't need the dedicated Failover policy, since any routing policy other than Failover becomes active-active once you attach health checks to its records. With weighted records, Route 53 splits traffic by the configured weights while everything's healthy, then drops any record whose health check fails and redistributes its share among the remaining healthy records. A record set to weight zero acts as a pure standby: it receives no traffic until every nonzero-weight record has gone unhealthy, at which point Route 53 falls back to it.

Trap Assuming you need the Failover policy to get failover: any non-Failover policy with health checks attached already drops unhealthy records on its own.

5 questions test this

Multi-tier DNS: latency alias to pick the region, weighted records within it

Multi-tier DNS nests routing policies for multi-region: latency alias records at the top choose the best region for each user, and each points at weighted records inside that region to spread traffic across local resources. Turn on Evaluate Target Health at the latency alias and health cascades up the tree: Route 53 counts a region healthy only when at least one of its weighted children is healthy, so a region with all children down is removed from latency routing automatically. Layering the policies this way optimizes region choice and in-region distribution at once.

Trap Leaving Evaluate Target Health off the latency alias: without it health doesn't cascade, and Route 53 keeps routing users to a region whose endpoints are all down.

2 questions test this

Aurora failover runs through priority tiers from 0 first to 15 last

Aurora failover runs through promotion priority tiers: every Aurora Replica carries a tier from 0 to 15, and when the writer fails Aurora promotes whichever available replica holds the lowest tier number (tier 0 first, tier 15 last). Use that to control which replica becomes the new writer: give tier 0 to your preferred standby, for example one whose instance class matches the primary so capacity doesn't drop on promotion, and push replicas reserved for analytics or reporting into higher tiers so they're promoted only as a last resort. Ties at the same tier are broken by the largest instance size.

Trap Reading a higher tier number as higher priority: tier 0 is promoted first, so the biggest number is promoted last.

3 questions test this

Copy RDS snapshots cross-region or cross-account to seed your DR

RDS automatic snapshots are pinned to the region and account where they were taken and can't be moved directly, which makes them unsuitable on their own for cross-region or cross-account DR. Manual snapshots, or copies you make of an automatic one^[17], can be copied cross-region (re-encrypted with a KMS key in the destination region) or shared cross-account, and that copied snapshot is exactly what seeds a pilot-light or warm-standby environment in the recovery region. So the DR-ready artifact is always a manual or copied snapshot, never the raw automatic one.

Trap Relying on automatic snapshots for cross-region DR: they can't be copied or shared directly, so you have to make a manual snapshot or copy first.

Highly Available and Fault-Tolerant Architectures

Auto Scaling: target tracking vs step vs scheduled, when each applies

Route 53 routing policies catalog: 8 policies, 8 use cases

Multi-AZ vs Read Replicas vs Aurora Replicas: the three RDS scaling primitives

Aurora Global Database: setup, failover, cross-region read scale

S3 Cross-Region Replication: filters, versioning, RTC, batch backfill

DR strategy worked examples: cost + RTO + RPO trade-off across all four patterns

DR strategies compared

Decision tree

Cheat sheet

Also tested in

References