Domain 3 of 4 · Chapter 5 of 5

High-Performing Data Ingestion and Transformation

Unlock the complete study guide + 1,040 practice questions across 16 full exams.

Bundled into the existing AWS Certified Solutions Architect – Associate premium course — no separate purchase.

Included in this chapter:

  • Kinesis Data Streams: shard math, enhanced fan-out, retention tiers
  • Kinesis Firehose: buffer / transformation / format conversion patterns
  • Glue jobs: Spark vs Python, bookmarks, partition projection
  • Athena cost optimization: partitioning + columnar formats + workgroups

Ingestion + transformation services

ServiceTypeLatencyBest for
Kinesis Data StreamsStreamSub-secondOrdered events with replay (24h–365d); custom consumers
Kinesis FirehoseStream → store60+ s bufferFire-and-forget delivery to S3 / Redshift / OpenSearch / Splunk
MSK (Managed Kafka)StreamSub-secondKafka-compatible workloads; existing Kafka tooling
AWS GlueServerless ETL (Spark + Python)MinutesRecurring ETL; schema discovery via crawlers; Glue Data Catalog
EMRHadoop / Spark / Hive / Presto / Flink clusterMinutes-hoursLarge-scale data processing; non-Glue frameworks
AthenaServerless SQL over S3Seconds-minutesAd-hoc SQL; per-TB-scanned billing
RedshiftColumnar MPP DWSeconds-minutesPetabyte analytics warehouse; BI dashboards
Lake FormationData lake governanceCentralized fine-grained access on S3 + Glue Catalog
DataSync / DMS / SnowballTransferHours-daysOn-prem or other cloud → S3 migration

Cheat sheet

  • Streaming vs batch — pick by latency requirement
  • Pick transformation tool by data shape + size
  • Kinesis Data Firehose buffers at time OR size threshold
  • Kinesis Data Streams: 1 shard = 1 MB/s in, 2 MB/s out
  • Glue Data Catalog is THE metastore for the lake
  • Lake Formation = fine-grained access control over the lake
  • Glue worker types: G.1X / G.2X / G.4X / G.8X / G.025X
  • DMS: replicate from source to target with optional CDC
  • Snowball Edge for offline bulk transfer; Snowmobile retired
  • IoT Core error action fires when the primary rule action fails — preserves messages
  • IoT Core Basic Ingest bypasses the message broker — no per-message cost
  • Firehose Lambda transform: return recordId + result status + base64 data

Unlock with Premium — includes all practice exams and the complete study guide.

Also tested in

References

  1. Amazon Kinesis Data Streams
  2. Amazon Data Firehose (formerly Kinesis Data Firehose)
  3. Amazon MSK (Managed Streaming for Apache Kafka)
  4. What is AWS Glue
  5. Amazon EMR
  6. What is AWS Batch
  7. What is Amazon Athena
  8. AWS Glue components (Data Catalog)
  9. What is AWS Lake Formation
  10. Kinesis Data Streams scaling and quotas (shard limits)
  11. Kinesis Data Streams enhanced fan-out
  12. AWS Database Migration Service