2. Amazon Athena
Amazon Athena is an
interactive query service
that makes it easy to
analyze data directly from
Amazon S3 using
standard SQL
3. Serverless and Easy to Use
No infrastructure or administration
• Warm compute pools across multiple AZs
• Data in S3, for HA and high durability
Zero spin up time
• Connect to a service endpoint
• Start querying
4. Familiar OSS Technology
Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning
• EXTERNAL tables – no impact on underlying data
5. ANSI SQL
• Complex joins, nested queries, window functions
• Complex data types (arrays, structs)
• Partitioning of data by any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour, Customer Key, Date
6. Open Data Formats
• Text files, e.g. CSV, TSV, custom delimiter
• Apache Web Logs, CloudTrail logs
• JSON (simple, nested), AVRO
• Columnar formats, e.g. Apache Parquet & Apache ORC
• Logstash Grok for unstructured text files
• Compressed files (Snappy, Zlib, GZIP, and LZO)
• Encrypted data (SSE-S3, SSE-KMS, CSE-KMS)
• Use large (128MB – 1GB) compressed files
7. Pay Per Query – 5$ Per TB Data Scanned
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
8. Apache Parquet and Apache ORC
• Columnar formats
• Store data in columns, not rows
• Support for predicate pushdown
• Filter data where it lives
• Schema segregated into footer
• Integrated compression and indexes