This document outlines a presentation about building flexible data lake architectures on AWS. It discusses challenges with traditional data warehousing and siloed data sources. A data lake architecture using AWS services like S3, Athena, and Glue is presented as a way to ingest and analyze all data in one centralized location. The presentation also covers real-time analytics platforms using Kinesis and batch vs streaming data processing. It concludes with a demo of visualizing data in S3 using Athena and Amazon QuickSight.
10. Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of conception,
hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
Constraints
• Proprietary software license paid per node per year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing
15. Evolution of Data Architectures
Today: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL without
having to provision a cluster or touch infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
Constraints
• Limited to SQL, Hive and Spark jobs today. More
frameworks to come!
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface in
web browser