3. Find me on LinkedIn
AWS Certifications
Presented by Adam Book
4. What is Athena?
Big Data Offerings Announced at re:Invent 2016
AWS Glue Amazon Athena AWS Greengrass
5. What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling
infrastructure
6. About Athena?
Athena is currently only in 2 regions:
⢠US-EAST-1 (Northern Virginia)
⢠US-WEST-2 (Oregon)
Pricing:
$5 per TB scanned
No charge for failed queries
No charge for Data Definition Language Statements (DDL)
⢠CREATE / ALTER / DROP Table
⢠Managing partitions
7. What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data
⢠S3 data is never modified (data is loaded in read only memory)
⢠Cross-region buckets are supported
⢠You can access Athena via JDBC
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling
infrastructure
⢠You only pay for the queries you run
⢠Queries execute in parallel â so results are FAST, even with large datasets and complex
queries
8. Athena Queries
Service based on Presto (which is available in Amazon EMR)
ANSI SQL operators and functions
Table Creation is using APACHE HIVE DDL
⢠CREATE EXTERNAL_TABLE only
⢠CREATE TABLE as SELECT not supported
Unsupported Operations:
⢠User Defined Functions (UDFs or UDAFs)
⢠Stored Procedures
⢠Any Transaction found on Hive or Presto
⢠LZO is not supported (use Snappy instead)
9. Running queries on Athena
Run queries straight from the AWS Console
⢠Use the wizard for schema definition
⢠It can save queries
⢠It can run multiple queries in parallel
10. Running queries on Athena
Run queries from your favorite tool
(SQL Workbench, Agility, Aqua Data Studio)
⢠Requires the JDBC driver
JDBC 4.1-compatible driver: https://s3.amazonaws.com/athena-
downloads/drivers/AthenaJDBC41-1.0.0.jar.
Or via AWS CLI
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar [loc
Full documentation link HERE
11. Available Athena File Formats
⢠Apache Web Logs
⢠CSV
⢠TSV
⢠JSON
⢠Parquet
⢠ORC
⢠TEXT File w/ Custom Delimiters
⢠AVRO (COMING SOON)
12. Tips on File Formats
Use Compressed Formats: Snappy, Zlib, GZIP ( no LZO)
⢠Less I/O = Better performance & more cost savings
Use Structured / Columnar Formats
⢠Apache Parquet
⢠Apache ORC (Optimized Row Columnar)
13. The Difference between
Hive and Presto
disk
disk
map map
map map
reduce reduce
disk
reduce reduce
HIVE PRESTO
Wait between
stages
task
task
task
task
task task
task
Write data to disk
Memory to
memory data
transfer
⢠No disk IO
⢠Data chunk
must fit in
memory
All stages are
pipelined
⢠No wait time
⢠No fault-tolerance
19. Athena Catalog Management
Amazon Athena uses an internal data
catalog to store information and
schemas about the databases and
tables that you create for your data
stored in Amazon S3
For more information Read the documentation page for Catalog Management
21. Athena Tips and Tricks
You will need to have some understanding of the structure of your
data or have a DDL meta store before you can start querying with
Athena
When looking through your S3 buckets and finding one of the allowed
formats (CSV, TSV, Parquet, etc) you will have to have an
understanding of what colums are there so you can create the
Athena table before you can start querying
Avoid Surprises when working with Athena
with the following tips and tricks
22. Athena Tips and Tricks
Table names that begin with an underscore
Use backticks if table names begin with an underscore. For example:
CREATE TABLE
myUnderScoreTable (
`_id` string,
`_index` string,
...
Avoid Surprises when working with Athena
with the following tips and tricks
23. Athena Tips and Tricks
For the location clause, use a trailing slash
In the LOCATION clause, use a trailing slash for your folder, NOT filenames or glob characters
Avoid Surprises when working with Athena
with the following tips and tricks
Donât USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
USE
s3://path_to_bucket/
24. Athena Tips and Tricks cont.
Athena table names are case insensitive
If you are interacting with Apache Spark, then your table column names must be lowercase.
Athena is case insensitive but Spark requires lowercase table names.
Athena table names only allow the underscore character
Athena table names cannot contain any special characters beside the underscore _
Avoid Surprises when working with Athena
with the following tips and tricks
Amazon Athena was one of a few new big Data Services announced at re:Invent 2016
Along with AWS Glue, And GreenGrass (which was more of an IOT offering)
Amazon Athena is an interactive query service that allows you to query data from Amazon S3, without the need for clusters or data warehouses.Â
Consider a table with 3 equally sized columns, stored as an uncompressed text file with a total size of 3 TB on Amazon S3. Running a query to get data from a single column of the table, requires Amazon Athena to scan the entire file, because text formats canât be split.
This query would cost:Â $15. (Price for 3 TB scanned is 3 * $5/TB = $15)
If you compress your file using GZIP, you might see 3:1 compression gains. In this case, you would have a compressed file with a size of 1 TB. The same query on this file would cost $5. Athena has to scan the entire file again, but because itâs three times smaller in size, you pay one third of what you did before.
Amazon Athena is an interactive query service that allows you to query data from Amazon S3, without the need for clusters or data warehouses.Â
Amazon Athena is an interactive query service that allows you to query data from Amazon S3, without the need for clusters or data warehouses.Â
Remember that TSV is tab delimited format, and although there is no BSV specifically defined you could have that in the Text file with Custom Delimeters
Also regular and Nested Jason are accepted
Because Athenaâs costing model is based on queries and amount of data queried, using both compressed and columnar formats can help lower costs by allowing the engine to choose to search only certain columns to search and then searching over a compressed set of data resulting in exponential savings.
Your content is publically available, however itâs not something that you want scraped and then posted on the internet as public content.
In order for Athena to read the data you will have to but a temporary table.
In order for Athena to read the data you will have to but a temporary table.
If you just wanted to create the table with a pure SQL statement then you could do so with a command like the following from either the Athena SQL editor OR an editor which has been connected to Athena with the JDBC driver
In order for Athena to read the data you will have to but a temporary table.