Aws Atlanta meetup Amazon Athena

AWS
Athena
Querying Data in S3 with out the need for EMR

Find me on LinkedIn
AWS Certifications
Presented by Adam Book

What is Athena?
Big Data Offerings Announced at re:Invent 2016
AWS Glue Amazon Athena AWS Greengrass

What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling
infrastructure

About Athena?
Athena is currently only in 2 regions:
• US-EAST-1 (Northern Virginia)
• US-WEST-2 (Oregon)
Pricing:
$5 per TB scanned
No charge for failed queries
No charge for Data Definition Language Statements (DDL)
• CREATE / ALTER / DROP Table
• Managing partitions

What is Athena?
Athena allows you to Run interactive SQL Queries on S3 data
• S3 data is never modified (data is loaded in read only memory)
• Cross-region buckets are supported
• You can access Athena via JDBC
WITHOUT the need to CREATE, MANAGE, or worrying about Scaling
infrastructure
• You only pay for the queries you run
• Queries execute in parallel – so results are FAST, even with large datasets and complex
queries

Athena Queries
Service based on Presto (which is available in Amazon EMR)
ANSI SQL operators and functions
Table Creation is using APACHE HIVE DDL
• CREATE EXTERNAL_TABLE only
• CREATE TABLE as SELECT not supported
Unsupported Operations:
• User Defined Functions (UDFs or UDAFs)
• Stored Procedures
• Any Transaction found on Hive or Presto
• LZO is not supported (use Snappy instead)

Running queries on Athena
Run queries straight from the AWS Console
• Use the wizard for schema definition
• It can save queries
• It can run multiple queries in parallel

Running queries on Athena
Run queries from your favorite tool
(SQL Workbench, Agility, Aqua Data Studio)
• Requires the JDBC driver
JDBC 4.1-compatible driver: https://s3.amazonaws.com/athena-
downloads/drivers/AthenaJDBC41-1.0.0.jar.
Or via AWS CLI
aws s3 cp s3://athena-downloads/drivers/AthenaJDBC41-1.0.0.jar [loc
Full documentation link HERE

Available Athena File Formats
• Apache Web Logs
• CSV
• TSV
• JSON
• Parquet
• ORC
• TEXT File w/ Custom Delimiters
• AVRO (COMING SOON)

Tips on File Formats
Use Compressed Formats: Snappy, Zlib, GZIP ( no LZO)
• Less I/O = Better performance & more cost savings
Use Structured / Columnar Formats
• Apache Parquet
• Apache ORC (Optimized Row Columnar)

The Difference between
Hive and Presto
disk
disk
map map
map map
reduce reduce
disk
reduce reduce
HIVE PRESTO
Wait between
stages
task
task
task
task
task task
task
Write data to disk
Memory to
memory data
transfer
• No disk IO
• Data chunk
must fit in
memory
All stages are
pipelined
• No wait time
• No fault-tolerance

Bypassing the wizard
CREATE EXTERNAL TABLE IF NOT EXISTS default.elb_logs ( `request_timestamp` string,
èlb_name` string,
`request_ip` string,
`request_port` int,
`backend_ip` string,
`backend_port` int, `request_processing_time` double,
`backend_processing_time` double, `client_response_time` double,
èlb_response_code` string, `backend_response_code` string,
`received_bytes` bigint, `sent_bytes` bigint, `request_verb` string,
ùrl` string, `protocol` string, ùser_agent` string, `ssl_cipher` string, `ssl_protocol` string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES ( 'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*) (-|[^]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|".*") ([^ "]*|".*"))?')
LOCATION 's3://athena-examples/elb/plaintext/';

Athena Catalog Management
Amazon Athena uses an internal data
catalog to store information and
schemas about the databases and
tables that you create for your data
stored in Amazon S3
For more information Read the documentation page for Catalog Management

Athena Tips and Tricks
You will need to have some understanding of the structure of your
data or have a DDL meta store before you can start querying with
Athena
When looking through your S3 buckets and finding one of the allowed
formats (CSV, TSV, Parquet, etc) you will have to have an
understanding of what colums are there so you can create the
Athena table before you can start querying
Avoid Surprises when working with Athena
with the following tips and tricks

Table names that begin with an underscore
Use backticks if table names begin with an underscore. For example:
CREATE TABLE
myUnderScoreTable (
`_id` string,
`_index` string,
...

For the location clause, use a trailing slash
In the LOCATION clause, use a trailing slash for your folder, NOT filenames or glob characters
Don’t USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
USE
s3://path_to_bucket/

Athena Tips and Tricks cont.
Athena table names are case insensitive
If you are interacting with Apache Spark, then your table column names must be lowercase.
Athena is case insensitive but Spark requires lowercase table names.
Athena table names only allow the underscore character
Athena table names cannot contain any special characters beside the underscore _

Questions?
Image by http://www.gratisography.com/

Interested in Sponsoring
AWS Atlanta?
Image by http://www.gratisography.com/

Aws Atlanta meetup Amazon Athena

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Aws Atlanta meetup Amazon Athena

Ähnlich wie Aws Atlanta meetup Amazon Athena (20)

Mehr von Adam Book

Mehr von Adam Book (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Aws Atlanta meetup Amazon Athena

Hinweis der Redaktion