This 1-day course provides hands-on skills in ingesting, analyzing, transforming and visualizing data using AWS Athena and getting the best performance when using it at scale.
Audience:
This class is intended for data engineers, analysts and data scientists responsible for: analyzing and visualizing big data, implementing cloud-based big data solutions, deploying or migrating big data applications to the public cloud, implementing and maintaining large-scale data storage environments, and transforming/processing big data.
9. [1] Challenges
Organizations are challenged with data analysis without heavy investments and long deployment time
● Significant amount of effort required to analyze data on S3
● Users often have access to only aggregated data sets
● Managing Hadoop or data warehouse requires expertise
10. [1] Introducing AWS Athena
Athena is an interactive query service that makes it easy to
analyze data directly from AWS S3 using Standard SQL
11. [1] AWS Athena Overview
Easy to use
1. Login to a console
2. Create a table (either by following a wizard or by typing Hive DDL statement)
3. Start querying
12. [1] AWS Athena is Highly Available
High Availability Features
● You connect to a service endpoint or log into a console
● Athena uses warm compute pools across multiple availability zones
● Your data is in Amazon S3 which has 99.999999999% durability
13. [1] Querying Data Directly from Amazon S3
Direct access to your data without hassles
● No loading of data
● No ETL required
● No additional storage required
● Query of data in raw format
14. [1] Use ANSI SQL
Use of skills you probably already have
● Start with writing Standard ANSI SQL syntax
● Support for complex joins, nested queries & window functions
● Support for complex data types (arrays, structs)
● Support for partitioning of data by any key:
○ e.g. date, time, custom keys
○ Or customer-year-month-day-hour
15. [1] AWS Athena Overview
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Features:
● Serverles with zero spin-up time and transparent upgrades
● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format
○ AVRO (coming soon)
● Compression is supported out of the box
● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query
Additional Information:
● Not a general purpose database
● Usually used by Data Analysts to run interactive queries over large datasets
● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
16. [1] Underlying Technologies
Presto (originating from Facebook)
● Used for SQL queries
● In-memory distributed querying engine ANSI SQL compatible with
extensions
Hive (originating from Hadoop project)
● Used for DDL functionality
● Complex data types
● Multitude of formats
● Supports data partitioning
20. [2] Interacting with AWS Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Web User Interface:
● Run queries and examine results
● Manage databases and tables
● Save queries and share across organization for re-use
● Query History
JDBC Driver:
● Programmatic way to access AWS Athena
○ SQL Workbench, JetBrains DataGrip, sqlline
○ Your own app
AWS QuickSight:
● Visualize Athena data with charts, pivots and dashboards.
23. [3] Data and Compression Formats
The data formats presently supported are
● CSV
● TSV
● Parquet (Snappy is default compression)
● ORC (Zlib is default compression)
● JSON
● Apache Web Server logs (RegexSerDe)
● Custom Delimiters
Compression Formats
● Currently, Snappy, Zlib, and GZIP are the supported compression formats.
● LZO is not supported as of today
24. [3] CSV Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://nyc-yellow-trips/csv/'
27. [3] RegEx Serde (Apache Log Example)
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Date DATE, Time STRING, Location STRING,
Bytes INT, RequestIP STRING, Method STRING,
Host STRING, Uri STRING, Status INT, Referrer STRING,
os STRING, Browser STRING, BrowserVersion STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$")
LOCATION 's3://athena-examples/cloudfront/plaintext/';
28. [3] Comparing Formats
PARQUET
● Columnar format
● Schema segregation into footer
● Column major format
● All data is pushed to the leaf
● Integrated compression and indexes
● Support for predicate pushdown
ORC
● Apache Top Level Project
● Schema segregation into footer
● Column major format with stripes
● Integrated compression and indexes
and stats
● Support for predicate pushdown
30. [3] Converting to Parquet or ORC format
● You can use Hive CTAS to convert data:
CREATE TABLE new_key_value_store
STORED AS PARQUET
AS SELECT c1, c2, c3, .., cN FROM noncolumunartable
SORT BY key
● You can also use Spark to convert the files to Parquet or ORC
● 20 lines of PySpark code running on EMR [1]
○ Converts 1TB of text data into 130GB of Parquet with Snappy compression
○ Approx. cost is $5
[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
31. [3] Pay By the Query ($5 per TB scanned)
● You are paying by the amount of scanned data
● Means to save on cost
○ Compress
○ Convert to columnar format
○ Use partitioning
● Free: DDL queries, failed queries
Dataset Size on S3 Query Runtime Data Scanned Cost
Logs stored as CSV 1TB 237s 1.15TB $5.75
Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013
Savings 87% less 34x faster 99% less 99.7% cheaper
34. [4] Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost
Benefits of Data Partitioning:
● Partitions limit the scope of data being scanned during the query
● Improves Performance
● Reduce query cost
● You can partition your data by any key
Common Practice:
● Based on time, often leading with a multi-level partitioning scheme
○ YEAR -> MONTH -> DAY -> HOUR
35. [4] Data already partitioned and stored on S3
$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
CREATE EXTERNAL TABLE impressions (
... ...)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
// load partitions into Athena
MSCK REPAIR TABLE impressions
// Run sample query
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
38. [5] Converting to Columnar Formats (batch data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Hive installed.
● In the step section of the cluster create statement, you can specify a script stored in Amazon S3,
which points to your input data and creates output data in the columnar format in an Amazon S3
location. In this example, the cluster auto-terminates.
39. [5] Converting to Columnar Formats (streaming data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Spark
● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
41. [6] Athena Security
Amazon offers three ways to control data access:
● AWS Identity and Access Management policies
● Access Control Lists
● Amazon S3 bucket policies
Users are in control who can access data on S3. It’s possible to fine-tune security to allow different
people to see different sets of data and also to grant access to other user’s data.
43. [7] Service Limits
You can request a limit increase by contacting AWS Support.
● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent
queries at one time per account.
● Query timeout: 30 minutes
● Number of databases: 100
● Table: 100 per database
● Number of partitions: 20k per table
● You may encounter a limit for Amazon S3 buckets per account, which is 100.
44. [7] Known Limitations
The following are known limitations in Amazon Athena
● User-defined functions (UDF or UDAFs) are not supported.
● Stored procedures are not supported.
● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of
keywords not supported, see Unsupported DDL.
● LZO is not supported. Use Snappy instead.
45. [7] Avoid Surprises
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
...
For the LOCATION clause, using a trailing slash
USE
s3://path_to_bucket/
DO NOT USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
47. DoIT International confidential │ Do not distribute
Google BigQuery
• Serverless Analytical Columnar Database based on Google Dremel
• Data:
• Native Tables
• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)
• Ingestion:
• File Imports
• Streaming API (up to 100K records/sec per table)
• Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)
• ANSI SQL 2011
• Priced at $5/TB of scanned data + storage + streaming (if used)
• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
48. DoIT International confidential │ Do not distribute
Summary
Feature Product AWS Athena Google BigQuery
Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native
ANSI SQL Support Yes* Yes*
DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)
Underlying Technology FB Presto Google Dremel
Caching No Yes
Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity
User Defined Functions No Yes
Data Partitioning On Any Key By DAY
Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
49. DoIT International confidential │ Do not distribute
Test Drive Summary
Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %
[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%
[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%
[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%
[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%
[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%
[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%
[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
50. DoIT International confidential │ Do not distribute
What Athena does better than BigQuery?
Advantages:
• Can be faster than BigQuery, especially with federated/external tables
• Ability to use regex to define a schema (query files without needing to change the format)
• Can be faster and cheaper than BigQuery when using a partitioned/columnar format
• Tables can be partitioned on any column
Issues:
• It’s not easy to convert data between formats
• Doesn’t support DDL, i.e. no insert/update/delete
• No built-in ingestion
51. DoIT International confidential │ Do not distribute
What BigQuery does better than Athena?
• It has native table support giving it better performance and more features
• It’s easy to manipulate data, insert/update records and write query results back to a table
• Querying native tables is very fast
• Easy to convert non-columnar formats into a native table for columnar queries
• Supports UDFs, although they will be available in the future for Athena
• Supports nested tables (nested and repeated fields)