Domain 2 of 4 · Chapter 2 of 2

Highly Available and Fault-Tolerant Architectures

Auto Scaling: target tracking vs step vs scheduled — when each applies

Auto Scaling Groups (ASGs) replace failed instances and scale capacity. Four scaling policy types, each for a different signal pattern.

Target tracking scaling (the default for new ASGs):

  • Pick a metric + target value: e.g. 'average CPU = 50%'.
  • ASG auto-creates two CloudWatch alarms (scale-out + scale-in) and adjusts capacity to maintain the target.
  • Built-in metrics: ASGAverageCPUUtilization, ASGAverageNetworkIn/Out, ALBRequestCountPerTarget.
  • Custom metrics: any CloudWatch metric (e.g. queue depth, custom application metric).
  • Use for: 95% of scaling needs. Simplest; auto-tunes.

Step scaling:

  • Define explicit steps: 'if CPU 70-80, add 1 instance; if CPU 80-90, add 2; if CPU > 90, add 4'.
  • Cooldown period prevents over-reacting (e.g. 60 seconds before next scaling action).
  • Use for: workloads that need precise control over scaling magnitude relative to load severity.

Simple scaling (legacy):

  • One alarm → one scaling action (e.g. add 1 instance).
  • Don't use for new designs — superseded by step scaling.

Scheduled scaling:

  • Set capacity changes by clock (UTC): 'at 09:00 weekdays scale to 20 instances; at 18:00 scale to 5'.
  • Use for: predictable load patterns (business-hours workloads, scheduled batch processing).
  • Stack with target tracking: scheduled sets baseline; target tracking handles deviations.

Predictive scaling:

  • ML-based forecast of future load using up to 14 days of metric history.
  • Pre-emptively scales BEFORE load hits.
  • Best for cyclical loads with 24h+ patterns.

Instance refresh:

  • Replaces instances in the ASG (rolling) — useful for deploying a new AMI without downtime.
  • Configurable healthy percentage, instance warmup, and skip-matching to avoid replacing instances already on the new template.

ELB vs EC2 health check type:

  • HealthCheckType=EC2: only EC2 instance-level health (hardware fail).
  • HealthCheckType=ELB: includes load balancer health checks (app-level failures).
  • ALWAYS use ELB in production — catches app crashes that EC2 health misses.

Termination policy (which instance to terminate when scaling in):

  • Default: balance across AZs, then OldestLaunchTemplate, then ClosestToNextInstanceHour.
  • Customizable. The OldestInstance variant is common for canary-style rolling deploys.

Lifecycle hooks:

  • Pause instance launch (autoscaling:EC2_INSTANCE_LAUNCHING) for up to 100 minutes while custom setup runs.
  • Pause instance terminate (autoscaling:EC2_INSTANCE_TERMINATING) to drain connections / snapshot logs before termination.
  • Hook completes with CompleteLifecycleAction API.

Warm pools (cost optimization):

  • Pre-initialize instances in a 'warm pool' (stopped state) — launch is faster than from scratch.
  • Used for apps with long warmup (large container pulls, JIT warmup, cache priming).

Route 53 routing policies catalog: 7 policies, 7 use cases

Route 53 supports seven routing policies. The exam tests pattern recognition — match the scenario to the policy.

1. Simple routing:

  • One record name → one or more values.
  • If multiple values, Route 53 returns all in random order; client picks one.
  • Use for: static / single-target records (cname.example.com → fixed IP).

2. Weighted routing:

  • Multiple records, each with a weight (0-255).
  • Route 53 distributes traffic proportionally to weights.
  • Use for: blue/green or canary deployment (90% to v1, 10% to v2; ramp up).
  • Setting weight=0 effectively disables a record without deleting it.

3. Failover routing:

  • One primary + one secondary record.
  • Route 53 health check on primary; if it fails, secondary serves.
  • Use for: active/passive DR. Primary in us-east-1, secondary in us-west-2.
  • Critical: the health check on the primary is required — without it, Route 53 always serves the primary.

4. Latency-based routing:

  • One record per region, each pointing to that region's endpoint.
  • Route 53 picks the region with lowest measured latency from the user's edge location.
  • Latency table is pre-measured (not real-time per-request).
  • Use for: global apps minimising user-facing latency.

5. Geolocation routing:

  • Route by user's country / state.
  • Records specify continent / country / US state.
  • Use for: compliance ('EU users must hit EU region'), content localisation, geo-blocking.
  • Always include a Default record for users from regions you didn't enumerate.

6. Geoproximity routing (Traffic Flow only — uses Route 53 Traffic Flow product):

  • Similar to geolocation but supports BIAS — push more traffic to a specific region/resource.
  • Use for: shifting traffic during region failover, gradual migration between regions.

7. Multi-value answer routing:

  • Returns up to 8 healthy records.
  • Client-side load balancing (the client picks one).
  • Each record can have a health check; only healthy records returned.
  • Use for: poor-man's load balancing for non-HTTP traffic where you don't want an actual ELB.

Combining policies (Traffic Flow):

  • Route 53 Traffic Flow lets you nest policies (e.g. geolocation routing → latency routing within each geo).
  • Visual editor; version-controlled traffic policies.

Health checks:

  • Default check: every 30 seconds; consider healthy after 3 successes / unhealthy after 3 failures.
  • Fast check: every 10 seconds (paid).
  • Types: endpoint (HTTP/HTTPS/TCP), CloudWatch alarm (alarm state), calculated (AND/OR of other checks).
  • Calculated health checks: 'healthy if at least 2 of 3 endpoints are healthy' — useful for multi-endpoint apps.

Common SAA scenarios:

  • 'DR failover Region' → Failover routing + health checks.
  • 'Blue/green deployment' → Weighted routing, ramp from 0 → 100%.
  • 'Global app, lowest latency' → Latency-based routing.
  • 'EU compliance' → Geolocation routing.
  • 'No ELB but load balance' → Multi-value answer routing.

Multi-AZ vs Read Replicas vs Aurora Replicas — the three RDS scaling primitives

RDS / Aurora offer three distinct mechanisms for HA and read scaling. They solve different problems; the exam tests which to apply.

Multi-AZ deployment (HA):

  • Synchronous standby in a different AZ (same region).
  • Standby is NOT readable — it exists only for failover.
  • Failover triggers: planned (instance upgrade, OS patching) or unplanned (hardware fail, AZ outage).
  • Failover time: typically 60-120 seconds. DNS endpoint updates to point at the standby.
  • Clients with cached DNS see ~60-120 s of errors; app should reconnect on connection failure.
  • Cost: ~2× single-AZ (you pay for the standby too).

Use for: any production workload with availability SLA. The default 'is this production?' answer.

Multi-AZ DB cluster (newer, RDS MySQL/PostgreSQL):

  • One writer + two readable standbys across 3 AZs.
  • Standbys ARE readable (unlike traditional Multi-AZ).
  • Faster failover (~35 seconds typical).
  • Better than traditional Multi-AZ for reads + faster failover.

Read Replicas (read scaling):

  • Asynchronous replication from primary to N replicas (up to 15 per source for most engines).
  • Replicas serve read-only traffic — offload from primary.
  • Can be in SAME AZ, DIFFERENT AZ, or DIFFERENT REGION (cross-region replication).
  • Cross-region replicas can be promoted to standalone read-write (used in DR).
  • Replication lag varies; can be seconds to minutes under high write load.

Use for: read-heavy workloads, geographic read distribution (cross-region replica = lower latency for distant users), DR (cross-region replica = recovery target).

Cascading replicas (since 2023): a read replica can have its OWN read replicas. Useful for multi-region read distribution without burdening the primary.

Aurora Replicas (Aurora-specific):

  • Up to 15 Aurora Replicas per cluster, all sharing the same storage layer.
  • Sub-100ms replica lag typical (often < 10 ms) — shared storage means no log shipping.
  • Auto-promotion: if writer fails, Aurora promotes a replica → ~1 minute failover.
  • Reader endpoint load-balances across all Aurora Replicas.
  • Cross-region: use Aurora Global Database (separate feature) for < 1 s cross-region replication.

Use for: any Aurora workload — Replicas serve reads AND act as warm failover targets. Beats RDS Multi-AZ + Read Replicas combined for most use cases.

Decision pattern:

  • 'Production HA, MySQL / Oracle / SQL Server' → Multi-AZ deployment.
  • 'Read-heavy workload, RDS' → Multi-AZ + Read Replicas (HA + read scale).
  • 'Aurora workload, any use case' → Aurora Replicas (cover both HA and reads).
  • 'Cross-region DR with < 1 s RPO' → Aurora Global Database (NOT Read Replicas — they're async).
  • 'Read scaling for distant users (latency)' → Cross-region Read Replicas or Aurora Global Database.

Common SAA traps:

  • 'Multi-AZ for read scaling' → NO; standby isn't readable (unless Multi-AZ DB cluster mode).
  • 'Promote read replica during a regional failure' → for cross-region replicas, YES; for same-region, depends on use case.
  • 'Aurora Global Database is the same as Aurora Replicas' → NO; Global Database adds cross-region replication.

Aurora Global Database: setup, failover, cross-region read scale

Aurora Global Database (AGDB) replicates an Aurora cluster across regions via a dedicated network channel. Different from cross-region read replicas in three important ways.

vs cross-region Read Replicas:

  • AGDB: < 1 second cross-region RPO typical; managed failover; secondary region clusters are full Aurora clusters (auto-scale storage, replicas, etc.).
  • Read Replicas: log-shipping based; lag can be seconds-minutes; manual failover by promote.

Architecture:

  • 1 primary region (1 writer + up to 15 readers).
  • Up to 5 secondary regions, each with up to 15 readers (75 total readers possible).
  • Replication is one-way: primary → secondaries. Secondaries can't accept writes (read-only).

Use cases:

  1. Cross-region DR (RPO < 1 s, RTO ~1 min via managed promote).
  2. Low-latency reads for global users — local secondary region serves reads to nearby users.
  3. Read scaling across regions — analytics or reporting workloads on a secondary region don't impact primary write performance.

Setup steps:

  1. Create an Aurora cluster in the primary region (MySQL 5.6+ / 5.7+ / 8.0 or PostgreSQL 10+).
  2. From the cluster's Modify menu, choose 'Add region' → pick secondary region(s).
  3. Aurora provisions the secondary cluster with the same engine version + parameter group + KMS key (multi-region KMS key required for cross-region encrypted replication).
  4. Replication begins automatically.

Managed planned failover (zero data loss):

  • Use for: failover testing, primary-region maintenance, controlled migration.
  • Aurora pauses writes briefly, ensures secondary has replicated all changes, promotes secondary to primary, repoints replication.
  • ~ 1 minute total downtime.

Unplanned failover (cross-region):

  • Manually invoke RemoveFromGlobalCluster + PromoteReadReplicaDBCluster on the secondary.
  • Or use a managed playbook with Route 53 health-check failover routing.
  • Data loss possible (whatever wasn't replicated when primary failed).

Write forwarding (newer feature):

  • Applications in the secondary region can issue writes — Aurora forwards them transparently to the primary region.
  • Useful for read-mostly apps where occasional writes don't justify a full primary topology in the secondary region.
  • Adds cross-region latency on writes — not for high-write workloads.

Cost:

  • Per-secondary-region pricing: each secondary region is a full Aurora cluster (compute + storage + I/O charges).
  • Replication bandwidth between regions ≈ standard cross-region data transfer rates.
  • Generally 2-3× a single-region cluster.

Limitations:

  • Same engine version + family across regions (no major-version mixing).
  • 5 secondary regions max (15 reader instances each).
  • Parallel Query, Aurora Serverless v2 features have specific compatibility — check current docs.

Common exam scenarios:

  • 'Cross-region DR with < 1 min failover' → Aurora Global Database.
  • 'Global read scaling, no manual promote needed' → Aurora Global Database (reads served by secondary regions).
  • 'Need writes in multiple regions simultaneously' → NOT AGDB; consider DynamoDB Global Tables (multi-active writes) or accept eventual consistency.

S3 Cross-Region Replication: filters, versioning, RTC, batch backfill

S3 Cross-Region Replication (CRR) asynchronously copies new objects from a source bucket to a destination bucket in a different region. Same-Region Replication (SRR) does the same within a region.

Prerequisites:

  • Versioning must be enabled on BOTH source and destination buckets.
  • IAM role with permissions to read from source + write to destination + (if encrypted) use KMS keys in both regions.
  • Source and destination can be in same or different accounts.

Replication rules (define what to copy):

  • Filter by prefix: prefix: "customer-data/" → only objects with that prefix replicate.
  • Filter by tag: tag: { team=production } → only tagged objects replicate.
  • Filter combined (prefix AND tag).
  • Multiple rules per bucket; priority order resolves conflicts.

What replicates:

  • New objects after rule creation. Existing objects do NOT replicate by default.
  • Object metadata, tags, ACLs.
  • Delete markers (configurable; default off).
  • Replica modification sync (replicas back to source) — opt-in feature for two-way scenarios.

What does NOT replicate:

  • Objects encrypted with SSE-C (customer-supplied keys).
  • Objects created before replication was enabled (use S3 Batch Replication for backfill).
  • Permanent deletes (only delete markers can replicate, and only if configured).

Replication Time Control (RTC) — SLA-backed replication:

  • Without RTC: replication is best-effort; typically seconds-minutes but no guarantee.
  • With RTC: AWS guarantees 99.99% of objects replicate within 15 minutes.
  • RTC adds CloudWatch metrics (replication latency, missed-SLA count).
  • Cost: $0.015 per GB replicated (in addition to standard CRR data transfer + per-request charges).
  • Use for: compliance scenarios requiring guaranteed replication time, business-critical pipelines.

S3 Batch Replication (backfill existing objects):

  • Replicates objects that existed BEFORE replication rule creation.
  • Replicates objects that previously failed replication.
  • Uses S3 Batch Operations under the hood.
  • One-time operation per backfill request; pay per object processed.

Cross-account CRR:

  • Destination bucket policy must allow the source account's replication role.
  • Object ownership: by default, the source-account replication role writes the replica → source account owns replica. Use bucket-owner-full-control or set AccessControlTranslation so destination account owns.

KMS-encrypted CRR:

  • Source-side KMS key: replication role needs kms:Decrypt.
  • Destination-side KMS key: replication role needs kms:GenerateDataKey / kms:Encrypt.
  • Multi-region KMS keys (since 2021): use the same key ID across regions for seamless cross-region encrypted replication without re-encryption.

Common patterns:

  • Compliance — data must exist in 2 regions: CRR with RTC.
  • DR for static content (web assets, software downloads): CRR to alternate region; failover via Route 53.
  • Compliance — data must NOT leave a region: do NOT enable CRR; use Same-Region Replication (SRR) to a separate AZ-resilient bucket if needed.
  • Analytics on a copy: CRR to a dedicated analytics bucket in a region close to analytics tooling.

Cost:

  • Cross-region data transfer (~$0.02/GB depending on region pair).
  • PUT requests on destination (standard pricing).
  • Storage on destination (standard pricing for the chosen storage class).
  • Optional RTC ($0.015/GB).

DR strategy worked examples: cost + RTO + RPO trade-off across all four patterns

The four canonical DR strategies form a cost / recovery-time spectrum. Each maps to a specific RTO + RPO range. The exam tests pattern matching.

Worked Example 1 — Backup-and-restore (RTO: hours-days, RPO: hours):

Scenario: small business, daily backups, can tolerate ~24h of data loss + 4-8h to recover.

Architecture:

  • AWS Backup runs daily snapshots of RDS, EBS, DynamoDB, EFS.
  • Snapshots copied to alternate region nightly.
  • Application infrastructure as code (CloudFormation / CDK) ready to deploy.
  • On disaster: deploy infrastructure in DR region, restore latest snapshots, repoint DNS.

Monthly cost: ~$50/month (snapshot storage + cross-region copy). RTO: 4-8 hours (deploy infra + restore data). RPO: 24 hours (last backup).


Worked Example 2 — Pilot light (RTO: 10s of minutes, RPO: minutes):

Scenario: e-commerce platform, ~30 min RTO budget, ~1 min RPO budget (last few orders may be lost).

Architecture:

  • Primary region: full stack (ALB + EC2 ASG + RDS Multi-AZ + ElastiCache + S3).
  • DR region: RDS Read Replica (continuously replicated), S3 CRR enabled, EC2 launch templates ready but no instances running, ALB pre-provisioned with empty target group.
  • On disaster: promote RDS replica to standalone, launch EC2 ASG (size from launch template), register with ALB, update Route 53 failover record.

Monthly cost: ~$500/month (read replica + cross-region S3 + idle infrastructure). RTO: 15-30 minutes (mostly EC2 launch + warmup). RPO: 1 minute (read replica lag).


Worked Example 3 — Warm standby (RTO: minutes, RPO: seconds):

Scenario: SaaS application, < 5 min RTO budget, < 1 min RPO budget.

Architecture:

  • Primary region: full stack at full capacity.
  • DR region: full stack at REDUCED capacity (e.g. minimum ASG = 2 instances vs 20 in primary). All components running.
  • RDS / Aurora Multi-AZ in DR region with cross-region replica.
  • DNS uses Route 53 weighted routing: 100% primary, 0% DR (or active health checks).
  • On disaster: scale DR ASG up (Auto Scaling triggers), shift traffic via Route 53.

Monthly cost: $2 000/month (50% of primary, scaled down). RTO: 2-5 minutes (Auto Scaling scale-up time). RPO: < 30 seconds.


Worked Example 4 — Multi-region active-active (RTO: seconds, RPO: near-zero):

Scenario: global FinTech app, must handle region failure with no perceived downtime.

Architecture:

  • Primary AND secondary regions both serve traffic at full capacity (e.g. 50% to each via Route 53 latency-based routing).
  • Data layer: Aurora Global Database (< 1s RPO, managed failover) OR DynamoDB Global Tables (multi-active writes with last-writer-wins).
  • Caches: ElastiCache in each region (regionally local; global sync via DynamoDB Streams or Aurora replication).
  • Stateless tiers replicated in both regions.
  • On disaster: Route 53 health checks detect failure → traffic shifts to healthy region automatically.

Monthly cost: $5 000+/month (2× single-region; full stack everywhere). RTO: seconds (Route 53 health check + DNS TTL). RPO: near-zero (continuous replication).


Picking the right strategy:

Scenario Strategy
Cost > RTO; some downtime acceptable Backup-and-restore
Moderate cost; 15-30 min RTO acceptable Pilot light
Low RTO required; SLA matters Warm standby
Mission-critical; near-zero RTO Multi-region active-active

Common exam phrasing → answer:

  • 'lowest cost DR' → Backup-and-restore.
  • 'pre-provisioned DB, on-demand compute' → Pilot light.
  • 'scaled-down but running stack' → Warm standby.
  • 'no perceived downtime, full stack everywhere' → Multi-region active-active.

DR strategies compared

StrategyRTORPOCostPattern
Backup-and-restoreHours to daysHours (last backup)$Backups in S3 / AWS Backup. Restore = build infra + restore data on failover.
Pilot light10s of minutesMinutes (DB replicated continuously)$$DB replicated to DR region; compute templates ready but stopped; start on failover.
Warm standbyMinutesSeconds to minutes$$$Downsized but running stack in DR region; scale up on failover.
Multi-region active-activeSecondsNear-zero$$$$Full stack running in 2+ regions; Route 53 latency / multi-value routing; data via Aurora Global / DynamoDB Global Tables.

Decision tree

RTO requirement?Hours–days10s of minMinutesSecondsBackup-and-restore($ cheapest)AWS Backup → S3Pilot light($$)DB live + compute offWarm standby($$$)Downsized + runningMulti-regionactive-active ($$$$)Full stack everywhereStateful tier replication?Cross-regionSame regionMulti-region activeAurora Global Database(< 1 min RPO)RDS / Aurora Multi-AZ(synchronous standby)DynamoDB Global Tables(last-writer-wins)Always: 2+ AZs for stateless tier; ASG + ELB health checks for auto-recovery

Sharp facts the exam loves — give these one last read before exam day.

Cheat sheet

Sharp facts the exam loves — scan these before test day.

Multi-AZ for in-region HA; Multi-Region for DR

Spread stateless tiers across ≥2 AZs behind a load balancer. RDS Multi-AZ: synchronous standby, 60-120 s failover. Multi-Region only when single-region failure is in your threat model — adds latency, cost, replication complexity.

3 questions test this
Pick DR strategy by RTO and RPO

Four canonical strategies in increasing cost / decreasing RTO/RPO: backup-and-restore (hours), pilot-light (minutes-hours), warm-standby (minutes), multi-site active-active (near-zero). The exam picks the cheapest one that meets the stated RTO/RPO.

ASG + ELB + health checks is the default stateless tier

Auto Scaling Group across ≥2 AZs + ALB/NLB with health checks at the target group. Unhealthy instances drain + replace automatically. ASG min/max/desired control capacity; target-tracking or step scaling policies handle demand changes.

10 questions test this
Route 53 has 7 routing policies, each for a specific intent

Simple (one answer), Weighted (split traffic), Latency (route to closest region), Failover (active/passive via health check), Geolocation (by user country), Geoproximity (by lat/lon + bias), Multi-value (DNS-level load balancing).

3 questions test this
RDS Multi-AZ failover: 60-120 seconds typical

Automatic failover[1] updates the DNS endpoint to the standby; clients with cached DNS will see ~60-120 s of errors. App needs to reconnect on connection failure. Standby is NOT readable — for read scaling, use Read Replicas[13] (separate feature).

2 questions test this
Aurora Global Database: typically sub-second cross-region RPO

Replicates Aurora across regions[2] via dedicated network. RPO typically <1 s; failover RTO ~1 min (managed promotion). Secondary regions support read-only and can be promoted to writer. Big advantage over manual cross-region replicas.

1 question tests this
Route 53 health checks: 30 s default interval, 3 failures = unhealthy

Default check interval 30 s[14], healthy threshold 3, unhealthy threshold 3. Fast failover requires faster checks (10 s interval supported, paid). Use 'calculated' health checks to combine multiple endpoint checks for AND/OR logic.

5 questions test this
ELB health check type EC2 vs ELB

ASG HealthCheckType=EC2[15] only replaces instances that the EC2 instance itself reports as unhealthy (hardware fail). HealthCheckType=ELB replaces instances that the load balancer's health check fails — catches app-layer failures too. Use ELB for production.

9 questions test this
S3 Cross-Region Replication: async, prefix/tag-filtered

CRR replicates new objects[4] from source to destination bucket asynchronously (typically seconds). Existing objects need a one-time batch operation. Can filter by prefix or tag. Versioning must be on for both buckets.

8 questions test this
AWS Backup centralizes backups across services + accounts

Backup plans + selections[8] cover RDS, DynamoDB, EFS, EBS, FSx, Storage Gateway, etc. Cross-region, cross-account copy supported. Audit Manager + Backup Vault Lock for compliance scenarios.

Route 53 failover record requires a health check on PRIMARY

Active/passive failover routing[16] needs the primary record to have an associated health check. If the check fails, the secondary record is served. Without the health check, Route 53 always serves the primary.

8 questions test this
ASG health check grace period protects initializing instances

The health check grace period tells Auto Scaling how long to wait before evaluating the health of a newly launched instance after it enters InService state. Set it to at least as long as the application startup time; otherwise ELB health check failures during initialization cause continuous termination and replacement loops.

15 questions test this
ALB deregistration delay (connection draining) for long requests

When a target is removed from an ALB target group, the load balancer stops sending new requests but waits for the deregistration delay (default 300 s, range 0-3600 s) before completing deregistration. Set this value to at least the maximum expected request processing time to prevent HTTP 5xx errors during scale-in events.

8 questions test this
ALB slow start mode ramps traffic to new targets

Slow start mode causes the ALB to linearly increase the share of requests sent to a newly registered target over a configurable duration of 30–900 seconds. Use it when instances need a warm-up period (e.g., JIT cache warming, dataset loading) before they can handle their full share of traffic.

7 questions test this
ALB cross-zone load balancing always on at LB level, configurable per target group

For Application Load Balancers, cross-zone load balancing is always enabled at the load balancer level and cannot be turned off. However, it can be explicitly disabled at the target group level, overriding the load balancer default. When enabled, each LB node distributes traffic evenly across all registered targets in all enabled Availability Zones.

4 questions test this
NLB provides static IP per AZ; assign Elastic IPs for fixed addresses

Network Load Balancers automatically provide one static IP address per enabled Availability Zone. For internet-facing NLBs you can also assign your own Elastic IP per AZ, giving external clients fixed addresses to allowlist in firewalls. NLB operates at Layer 4, supports ultra-low latency, and preserves the client source IP address by default.

7 questions test this
Route 53 latency routing + Evaluate Target Health = active-active multi-region failover

Latency-based routing records with Evaluate Target Health set to Yes implement active-active failover: all healthy regions serve traffic based on lowest latency, and Route 53 automatically stops routing to a region when its resources become unhealthy. For hierarchical configurations (latency over weighted), ETH on the top-level alias causes Route 53 to traverse the tree and consider the region unhealthy only when all underlying weighted records fail.

12 questions test this
Route 53 calculated health checks aggregate child health check results

A calculated health check monitors other health checks (child health checks) and reports healthy when the number of healthy children meets a configurable threshold. This lets you trigger DNS failover only when a minimum number of endpoints are down (e.g., healthy if at least 2 of 6 servers are up), rather than reacting to individual endpoint failures.

8 questions test this
Route 53 weighted records + health checks implement active-active failover

Any routing policy other than Failover combined with health checks creates an active-active configuration. With weighted records, Route 53 distributes traffic according to weights while all records are healthy; when a record's health check fails, Route 53 excludes it from responses and redistributes remaining traffic to healthy records. A zero-weight record acts as a standby, receiving traffic only when all nonzero-weight records are unhealthy.

7 questions test this
Route 53 hierarchical routing: latency alias over per-region weighted records

A common multi-tier DNS pattern uses latency alias records at the top level (for region selection) pointing to weighted records within each region (for intra-region distribution). Enabling Evaluate Target Health on the latency alias causes Route 53 to consider a region healthy only if at least one of its weighted child records is healthy, enabling cascading health propagation.

4 questions test this
Aurora replica failover priority tiers 0 (highest) to 15 (lowest)

Each Aurora Replica can be assigned a promotion priority tier from 0 (promoted first) to 15 (promoted last). When the primary instance fails, Aurora promotes the replica with the lowest tier number. Assign tier 0 to the preferred standby (e.g., same instance class as the primary) and higher tiers to replicas used for analytics or reporting.

4 questions test this
RDS Snapshot copies cross-region + cross-account for DR

RDS automatic snapshots are tied to the source region. Manual or copied snapshots[17] can move cross-region (encrypted with a regional KMS key) or cross-account (share with target account). Used in pilot light / warm standby DR strategies.

Also tested in

References

  1. RDS Multi-AZ deployments
  2. Aurora Global Database
  3. DynamoDB Global Tables
  4. S3 Cross-Region Replication
  5. Elastic Load Balancing overview
  6. Aurora replication
  7. Disaster recovery options in the cloud (AWS whitepaper) Whitepaper
  8. AWS Backup
  9. Route 53 routing policy
  10. Amazon EC2 Auto Scaling User Guide
  11. ALB target-group health checks
  12. ASG target-tracking scaling policies
  13. RDS Read Replicas
  14. Route 53 health check values
  15. ASG health checks for instances
  16. Route 53 DNS failover health checks
  17. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CopySnapshot.html