SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Amazon Athena Workshop
26 January 2017
Agenda
1
2
3
4
5
Wi-Fi: DaHouseGuest, Pass: JustDoit!
Feedback Form: goo.gl/T9BZvy
Labs: github.com/doitintl/athena-workshop
2
Q & A
Breaks: 11:30 | 13:00 - 13:45 | 15:00
Facilities & Organization
DoIT International confidential │ Do not distribute
About us..
Vadim Solovey
CTO
Shahar Frank
Software Engineering Lead
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
DoIT International confidential │ Do not distribute
Workshop Agenda
● Module 1
○ Introduction to AWS Athena
○ Demo
● Module 2
○ Interacting with AWS Athena
○ Lab 2
● Module 3
○ Supported Formats and SerDes
○ Lab 3
● Module 4
○ Partitioning Data
○ Lab 4
● Module 5
○ Converting to columnar formats
○ Lab 5
● Module 6
○ Athena Security
● Module 7
○ Service Limits
● Module 8
○ Comparison to Google BigQuery
○ Demo
[1] AWS Athena
[1] Introduction
Understanding Purpose & Use-Cases
[1] Challenges
Organizations are challenged with data analysis without heavy investments and long deployment time
● Significant amount of effort required to analyze data on S3
● Users often have access to only aggregated data sets
● Managing Hadoop or data warehouse requires expertise
[1] Introducing AWS Athena
Athena is an interactive query service that makes it easy to
analyze data directly from AWS S3 using Standard SQL
[1] AWS Athena Overview
Easy to use
1. Login to a console
2. Create a table (either by following a wizard or by typing Hive DDL statement)
3. Start querying
[1] AWS Athena is Highly Available
High Availability Features
● You connect to a service endpoint or log into a console
● Athena uses warm compute pools across multiple availability zones
● Your data is in Amazon S3 which has 99.999999999% durability
[1] Querying Data Directly from Amazon S3
Direct access to your data without hassles
● No loading of data
● No ETL required
● No additional storage required
● Query of data in raw format
[1] Use ANSI SQL
Use of skills you probably already have
● Start with writing Standard ANSI SQL syntax
● Support for complex joins, nested queries & window functions
● Support for complex data types (arrays, structs)
● Support for partitioning of data by any key:
○ e.g. date, time, custom keys
○ Or customer-year-month-day-hour
[1] AWS Athena Overview
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Features:
● Serverles with zero spin-up time and transparent upgrades
● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format
○ AVRO (coming soon)
● Compression is supported out of the box
● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query
Additional Information:
● Not a general purpose database
● Usually used by Data Analysts to run interactive queries over large datasets
● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
[1] Underlying Technologies
Presto (originating from Facebook)
● Used for SQL queries
● In-memory distributed querying engine ANSI SQL compatible with
extensions
Hive (originating from Hadoop project)
● Used for DDL functionality
● Complex data types
● Multitude of formats
● Supports data partitioning
[1] Presto vs. Hive Architecture
[1] Use Cases
Athena complements Amazon Redshift and Amazon EMR
AWS Athena
[2] Interacting with AWS Athena
Develop, Execute and Visualize Queries
[2] Interacting with AWS Athena
Amazon Athena is server-less way to query your data that lives on S3 using SQL
Web User Interface:
● Run queries and examine results
● Manage databases and tables
● Save queries and share across organization for re-use
● Query History
JDBC Driver:
● Programmatic way to access AWS Athena
○ SQL Workbench, JetBrains DataGrip, sqlline
○ Your own app
AWS QuickSight:
● Visualize Athena data with charts, pivots and dashboards.
Hands On
Lab 2
Interacting with AWS Athena
Data Formats
[3] Supported Formats and SerDes
Efficient Data Storage
[3] Data and Compression Formats
The data formats presently supported are
● CSV
● TSV
● Parquet (Snappy is default compression)
● ORC (Zlib is default compression)
● JSON
● Apache Web Server logs (RegexSerDe)
● Custom Delimiters
Compression Formats
● Currently, Snappy, Zlib, and GZIP are the supported compression formats.
● LZO is not supported as of today
[3] CSV Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY ''
LINES TERMINATED BY 'n'
LOCATION 's3://nyc-yellow-trips/csv/'
[3] Parquet Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS PARQUET
LOCATION 's3://nyc-yellow-trips/parquet
tblproperties ("parquet.compress"="SNAPPY");
[3] ORC Example
CREATE EXTERNAL TABLE `mydb.yellow_trips`(
`vendor_id` string,
`pickup_datetime` timestamp,
`dropoff_datetime` timestamp,
`pickup_longitude` float,
`pickup_latitude` float,
`dropoff_longitude` float,
`dropoff_latitude` float,
`................` .....)
STORED AS ORC
LOCATION 's3://nyc-yellow-trips/orc/’
tblproperties ("parquet.compress"="ZLIB");
[3] RegEx Serde (Apache Log Example)
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
Date DATE, Time STRING, Location STRING,
Bytes INT, RequestIP STRING, Method STRING,
Host STRING, Uri STRING, Status INT, Referrer STRING,
os STRING, Browser STRING, BrowserVersion STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^
]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$")
LOCATION 's3://athena-examples/cloudfront/plaintext/';
[3] Comparing Formats
PARQUET
● Columnar format
● Schema segregation into footer
● Column major format
● All data is pushed to the leaf
● Integrated compression and indexes
● Support for predicate pushdown
ORC
● Apache Top Level Project
● Schema segregation into footer
● Column major format with stripes
● Integrated compression and indexes
and stats
● Support for predicate pushdown
[3] Comparing Formats
[3] Converting to Parquet or ORC format
● You can use Hive CTAS to convert data:
CREATE TABLE new_key_value_store
STORED AS PARQUET
AS SELECT c1, c2, c3, .., cN FROM noncolumunartable
SORT BY key
● You can also use Spark to convert the files to Parquet or ORC
● 20 lines of PySpark code running on EMR [1]
○ Converts 1TB of text data into 130GB of Parquet with Snappy compression
○ Approx. cost is $5
[1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
[3] Pay By the Query ($5 per TB scanned)
● You are paying by the amount of scanned data
● Means to save on cost
○ Compress
○ Convert to columnar format
○ Use partitioning
● Free: DDL queries, failed queries
Dataset Size on S3 Query Runtime Data Scanned Cost
Logs stored as CSV 1TB 237s 1.15TB $5.75
Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013
Savings 87% less 34x faster 99% less 99.7% cheaper
Hands On
Lab 3
Formats & SerDes
AWS Athena
[4] Partitioning Data
To improve performance and reduce cost
[4] Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost
Benefits of Data Partitioning:
● Partitions limit the scope of data being scanned during the query
● Improves Performance
● Reduce query cost
● You can partition your data by any key
Common Practice:
● Based on time, often leading with a multi-level partitioning scheme
○ YEAR -> MONTH -> DAY -> HOUR
[4] Data already partitioned and stored on S3
$ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
CREATE EXTERNAL TABLE impressions (
... ...)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;
// load partitions into Athena
MSCK REPAIR TABLE impressions
// Run sample query
SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
[4] Data is not partitioned
aws s3 ls s3://athena-examples/elb/plaintext/ --recursive
2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$
2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt
ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-
examples/elb/plaintext/2015/01/01/'
[5] AWS Athena
[5] Converting to Columnar Formats
Apache Parquet & ORC
[5] Converting to Columnar Formats (batch data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Hive installed.
● In the step section of the cluster create statement, you can specify a script stored in Amazon S3,
which points to your input data and creates output data in the columnar format in an Amazon S3
location. In this example, the cluster auto-terminates.
[5] Converting to Columnar Formats (streaming data)
Your Amazon Athena query performance improves if you convert your data into open source columnar
formats such as Apache Parquet or ORC.
The process for converting to columnar formats using an EMR cluster is as follows:
● Create an EMR cluster with Spark
● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
AWS Athena
[6] Athena Security
Authorization and Access
[6] Athena Security
Amazon offers three ways to control data access:
● AWS Identity and Access Management policies
● Access Control Lists
● Amazon S3 bucket policies
Users are in control who can access data on S3. It’s possible to fine-tune security to allow different
people to see different sets of data and also to grant access to other user’s data.
AWS Athena
[7] Service Limits
Know your limits and mitigate the risk
[7] Service Limits
You can request a limit increase by contacting AWS Support.
● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent
queries at one time per account.
● Query timeout: 30 minutes
● Number of databases: 100
● Table: 100 per database
● Number of partitions: 20k per table
● You may encounter a limit for Amazon S3 buckets per account, which is 100.
[7] Known Limitations
The following are known limitations in Amazon Athena
● User-defined functions (UDF or UDAFs) are not supported.
● Stored procedures are not supported.
● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of
keywords not supported, see Unsupported DDL.
● LZO is not supported. Use Snappy instead.
[7] Avoid Surprises
Use backticks if table names begin with an underscore. For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
...
For the LOCATION clause, using a trailing slash
USE
s3://path_to_bucket/
DO NOT USE
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
AWS Athena
[8] Comparing to Google BigQuery
Know your limits and mitigate the risk
DoIT International confidential │ Do not distribute
Google BigQuery
• Serverless Analytical Columnar Database based on Google Dremel
• Data:
• Native Tables
• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)
• Ingestion:
• File Imports
• Streaming API (up to 100K records/sec per table)
• Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)
• ANSI SQL 2011
• Priced at $5/TB of scanned data + storage + streaming (if used)
• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
DoIT International confidential │ Do not distribute
Summary
Feature  Product AWS Athena Google BigQuery
Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native
ANSI SQL Support Yes* Yes*
DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)
Underlying Technology FB Presto Google Dremel
Caching No Yes
Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity
User Defined Functions No Yes
Data Partitioning On Any Key By DAY
Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
DoIT International confidential │ Do not distribute
Test Drive Summary
Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %
[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%
[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%
[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%
[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%
[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%
[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%
[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
DoIT International confidential │ Do not distribute
What Athena does better than BigQuery?
Advantages:
• Can be faster than BigQuery, especially with federated/external tables
• Ability to use regex to define a schema (query files without needing to change the format)
• Can be faster and cheaper than BigQuery when using a partitioned/columnar format
• Tables can be partitioned on any column
Issues:
• It’s not easy to convert data between formats
• Doesn’t support DDL, i.e. no insert/update/delete
• No built-in ingestion
DoIT International confidential │ Do not distribute
What BigQuery does better than Athena?
• It has native table support giving it better performance and more features
• It’s easy to manipulate data, insert/update records and write query results back to a table
• Querying native tables is very fast
• Easy to convert non-columnar formats into a native table for columnar queries
• Supports UDFs, although they will be available in the future for Athena
• Supports nested tables (nested and repeated fields)
Remember to complete
your evaluations ;-)
https://goo.gl/T9BZvy

Weitere ähnliche Inhalte

Was ist angesagt?

마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
Amazon Web Services Korea
 
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
Simplilearn
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
Amazon Web Services Korea
 

Was ist angesagt? (20)

Azure API Management
Azure API ManagementAzure API Management
Azure API Management
 
02 api gateway
02 api gateway02 api gateway
02 api gateway
 
05. 마이크로서비스 아키텍처 환경에서의 SSO 구축방안
05. 마이크로서비스 아키텍처 환경에서의 SSO 구축방안05. 마이크로서비스 아키텍처 환경에서의 SSO 구축방안
05. 마이크로서비스 아키텍처 환경에서의 SSO 구축방안
 
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
마이크로서비스를 위한 AWS 아키텍처 패턴 및 모범 사례 - AWS Summit Seoul 2017
 
AWS July Webinar Series: Overview: Build and Manage your APIs with Amazon API...
AWS July Webinar Series: Overview: Build and Manage your APIs with Amazon API...AWS July Webinar Series: Overview: Build and Manage your APIs with Amazon API...
AWS July Webinar Series: Overview: Build and Manage your APIs with Amazon API...
 
AWS API Gateway
AWS API GatewayAWS API Gateway
AWS API Gateway
 
How To Run Your Containers on AWS with ECS & Fargate: Collision 2018
How To Run Your Containers on AWS with ECS & Fargate: Collision 2018How To Run Your Containers on AWS with ECS & Fargate: Collision 2018
How To Run Your Containers on AWS with ECS & Fargate: Collision 2018
 
Grafana vs Kibana
Grafana vs KibanaGrafana vs Kibana
Grafana vs Kibana
 
Build and Manage Your APIs with Amazon API Gateway
Build and Manage Your APIs with Amazon API GatewayBuild and Manage Your APIs with Amazon API Gateway
Build and Manage Your APIs with Amazon API Gateway
 
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당::  AWS Summit Online Korea 2020
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Azure vidyapeeth -Introduction to Azure Container Service & Registry Service
Azure vidyapeeth -Introduction to Azure Container Service & Registry ServiceAzure vidyapeeth -Introduction to Azure Container Service & Registry Service
Azure vidyapeeth -Introduction to Azure Container Service & Registry Service
 
Microservices & API Gateways
Microservices & API Gateways Microservices & API Gateways
Microservices & API Gateways
 
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
Chef vs Puppet vs Ansible vs Saltstack | Configuration Management Tools | Dev...
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Developer Experience on AWS
Developer Experience on AWSDeveloper Experience on AWS
Developer Experience on AWS
 
Prometheus Storage
Prometheus StoragePrometheus Storage
Prometheus Storage
 
Robot framework Gowthami Goli
Robot framework Gowthami GoliRobot framework Gowthami Goli
Robot framework Gowthami Goli
 
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
Amazon SNS로 지속적 관리가 가능한 대용량 푸쉬 시스템 구축 여정 - AWS Summit Seoul 2017
 
How to migrate an application in IBM APIc, and preserve its client credential
How to migrate an application in IBM APIc, and preserve its client credentialHow to migrate an application in IBM APIc, and preserve its client credential
How to migrate an application in IBM APIc, and preserve its client credential
 

Andere mochten auch

Ensayo blogger def 2
Ensayo blogger def 2Ensayo blogger def 2
Ensayo blogger def 2
AldoMaGe
 
Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"
sveta7940
 

Andere mochten auch (16)

Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
 
2015 Internet Trends Report
2015 Internet Trends Report2015 Internet Trends Report
2015 Internet Trends Report
 
K8S in prod
K8S in prodK8S in prod
K8S in prod
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 
AWS Cyber Security Best Practices
AWS Cyber Security Best PracticesAWS Cyber Security Best Practices
AWS Cyber Security Best Practices
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستيرالفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
الفيلم أداة للتدريس - التجربة الشخصية أثناء دراسة الماجستير
 
Superfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the worldSuperfunds Magazine - Ready to take on the world
Superfunds Magazine - Ready to take on the world
 
Ensayo blogger def 2
Ensayo blogger def 2Ensayo blogger def 2
Ensayo blogger def 2
 
Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"Посібник "Конспекти уроків у 1 семестрі"
Посібник "Конспекти уроків у 1 семестрі"
 

Ähnlich wie Amazon Athena Hands-On Workshop

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 

Ähnlich wie Amazon Athena Hands-On Workshop (20)

NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Taking SharePoint to the Cloud
Taking SharePoint to the CloudTaking SharePoint to the Cloud
Taking SharePoint to the Cloud
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud PlatformMap Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
 

Mehr von DoiT International

Mehr von DoiT International (15)

Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
GAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor CoresGAN training with Tensorflow and Tensor Cores
GAN training with Tensorflow and Tensor Cores
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
 
K8s best practices from the field!
K8s best practices from the field!K8s best practices from the field!
K8s best practices from the field!
 
An Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure MicroservicesAn Open-Source Platform to Connect, Manage, and Secure Microservices
An Open-Source Platform to Connect, Manage, and Secure Microservices
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Applying ML for Log Analysis
Applying ML for Log AnalysisApplying ML for Log Analysis
Applying ML for Log Analysis
 
GCP for AWS Professionals
GCP for AWS ProfessionalsGCP for AWS Professionals
GCP for AWS Professionals
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Running Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWSRunning Production-Grade Kubernetes on AWS
Running Production-Grade Kubernetes on AWS
 
Scaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami MahloofScaling Jenkins with Kubernetes by Ami Mahloof
Scaling Jenkins with Kubernetes by Ami Mahloof
 
CI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar DemriCI Implementation with Kubernetes at LivePerson by Saar Demri
CI Implementation with Kubernetes at LivePerson by Saar Demri
 
Kubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen FisherKubernetes @ Nanit by Chen Fisher
Kubernetes @ Nanit by Chen Fisher
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)
 

Kürzlich hochgeladen

Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
nilamkumrai
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Kürzlich hochgeladen (20)

Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...Katraj ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For S...
Katraj ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For S...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 

Amazon Athena Hands-On Workshop

  • 2. Agenda 1 2 3 4 5 Wi-Fi: DaHouseGuest, Pass: JustDoit! Feedback Form: goo.gl/T9BZvy Labs: github.com/doitintl/athena-workshop 2 Q & A Breaks: 11:30 | 13:00 - 13:45 | 15:00 Facilities & Organization
  • 3. DoIT International confidential │ Do not distribute About us.. Vadim Solovey CTO Shahar Frank Software Engineering Lead
  • 4. DoIT International confidential │ Do not distribute
  • 5. DoIT International confidential │ Do not distribute
  • 6. DoIT International confidential │ Do not distribute
  • 7. Workshop Agenda ● Module 1 ○ Introduction to AWS Athena ○ Demo ● Module 2 ○ Interacting with AWS Athena ○ Lab 2 ● Module 3 ○ Supported Formats and SerDes ○ Lab 3 ● Module 4 ○ Partitioning Data ○ Lab 4 ● Module 5 ○ Converting to columnar formats ○ Lab 5 ● Module 6 ○ Athena Security ● Module 7 ○ Service Limits ● Module 8 ○ Comparison to Google BigQuery ○ Demo
  • 8. [1] AWS Athena [1] Introduction Understanding Purpose & Use-Cases
  • 9. [1] Challenges Organizations are challenged with data analysis without heavy investments and long deployment time ● Significant amount of effort required to analyze data on S3 ● Users often have access to only aggregated data sets ● Managing Hadoop or data warehouse requires expertise
  • 10. [1] Introducing AWS Athena Athena is an interactive query service that makes it easy to analyze data directly from AWS S3 using Standard SQL
  • 11. [1] AWS Athena Overview Easy to use 1. Login to a console 2. Create a table (either by following a wizard or by typing Hive DDL statement) 3. Start querying
  • 12. [1] AWS Athena is Highly Available High Availability Features ● You connect to a service endpoint or log into a console ● Athena uses warm compute pools across multiple availability zones ● Your data is in Amazon S3 which has 99.999999999% durability
  • 13. [1] Querying Data Directly from Amazon S3 Direct access to your data without hassles ● No loading of data ● No ETL required ● No additional storage required ● Query of data in raw format
  • 14. [1] Use ANSI SQL Use of skills you probably already have ● Start with writing Standard ANSI SQL syntax ● Support for complex joins, nested queries & window functions ● Support for complex data types (arrays, structs) ● Support for partitioning of data by any key: ○ e.g. date, time, custom keys ○ Or customer-year-month-day-hour
  • 15. [1] AWS Athena Overview Amazon Athena is server-less way to query your data that lives on S3 using SQL Features: ● Serverles with zero spin-up time and transparent upgrades ● Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format ○ AVRO (coming soon) ● Compression is supported out of the box ● Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query Additional Information: ● Not a general purpose database ● Usually used by Data Analysts to run interactive queries over large datasets ● Currently available at us-east-1 (North Virginia) or the us-west-2 (Oregon)
  • 16. [1] Underlying Technologies Presto (originating from Facebook) ● Used for SQL queries ● In-memory distributed querying engine ANSI SQL compatible with extensions Hive (originating from Hadoop project) ● Used for DDL functionality ● Complex data types ● Multitude of formats ● Supports data partitioning
  • 17. [1] Presto vs. Hive Architecture
  • 18. [1] Use Cases Athena complements Amazon Redshift and Amazon EMR
  • 19. AWS Athena [2] Interacting with AWS Athena Develop, Execute and Visualize Queries
  • 20. [2] Interacting with AWS Athena Amazon Athena is server-less way to query your data that lives on S3 using SQL Web User Interface: ● Run queries and examine results ● Manage databases and tables ● Save queries and share across organization for re-use ● Query History JDBC Driver: ● Programmatic way to access AWS Athena ○ SQL Workbench, JetBrains DataGrip, sqlline ○ Your own app AWS QuickSight: ● Visualize Athena data with charts, pivots and dashboards.
  • 21. Hands On Lab 2 Interacting with AWS Athena
  • 22. Data Formats [3] Supported Formats and SerDes Efficient Data Storage
  • 23. [3] Data and Compression Formats The data formats presently supported are ● CSV ● TSV ● Parquet (Snappy is default compression) ● ORC (Zlib is default compression) ● JSON ● Apache Web Server logs (RegexSerDe) ● Custom Delimiters Compression Formats ● Currently, Snappy, Zlib, and GZIP are the supported compression formats. ● LZO is not supported as of today
  • 24. [3] CSV Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '' LINES TERMINATED BY 'n' LOCATION 's3://nyc-yellow-trips/csv/'
  • 25. [3] Parquet Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS PARQUET LOCATION 's3://nyc-yellow-trips/parquet tblproperties ("parquet.compress"="SNAPPY");
  • 26. [3] ORC Example CREATE EXTERNAL TABLE `mydb.yellow_trips`( `vendor_id` string, `pickup_datetime` timestamp, `dropoff_datetime` timestamp, `pickup_longitude` float, `pickup_latitude` float, `dropoff_longitude` float, `dropoff_latitude` float, `................` .....) STORED AS ORC LOCATION 's3://nyc-yellow-trips/orc/’ tblproperties ("parquet.compress"="ZLIB");
  • 27. [3] RegEx Serde (Apache Log Example) CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( Date DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+([^ ]+)s+[^(]+[(]([^;]+).*%20([^/]+)[/](.*)$") LOCATION 's3://athena-examples/cloudfront/plaintext/';
  • 28. [3] Comparing Formats PARQUET ● Columnar format ● Schema segregation into footer ● Column major format ● All data is pushed to the leaf ● Integrated compression and indexes ● Support for predicate pushdown ORC ● Apache Top Level Project ● Schema segregation into footer ● Column major format with stripes ● Integrated compression and indexes and stats ● Support for predicate pushdown
  • 30. [3] Converting to Parquet or ORC format ● You can use Hive CTAS to convert data: CREATE TABLE new_key_value_store STORED AS PARQUET AS SELECT c1, c2, c3, .., cN FROM noncolumunartable SORT BY key ● You can also use Spark to convert the files to Parquet or ORC ● 20 lines of PySpark code running on EMR [1] ○ Converts 1TB of text data into 130GB of Parquet with Snappy compression ○ Approx. cost is $5 [1] https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 31. [3] Pay By the Query ($5 per TB scanned) ● You are paying by the amount of scanned data ● Means to save on cost ○ Compress ○ Convert to columnar format ○ Use partitioning ● Free: DDL queries, failed queries Dataset Size on S3 Query Runtime Data Scanned Cost Logs stored as CSV 1TB 237s 1.15TB $5.75 Logs stored as PARQUET 130GB 5.13s 2.69GB $0.013 Savings 87% less 34x faster 99% less 99.7% cheaper
  • 33. AWS Athena [4] Partitioning Data To improve performance and reduce cost
  • 34. [4] Partitioning Data By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost Benefits of Data Partitioning: ● Partitions limit the scope of data being scanned during the query ● Improves Performance ● Reduce query cost ● You can partition your data by any key Common Practice: ● Based on time, often leading with a multi-level partitioning scheme ○ YEAR -> MONTH -> DAY -> HOUR
  • 35. [4] Data already partitioned and stored on S3 $ aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/ PRE dt=2009-04-12-13-00/ PRE dt=2009-04-12-13-05/ PRE dt=2009-04-12-13-10/ PRE dt=2009-04-12-13-15/ PRE dt=2009-04-12-13-20/ PRE dt=2009-04-12-14-00/ PRE dt=2009-04-12-14-05/ CREATE EXTERNAL TABLE impressions ( ... ...) PARTITIONED BY (dt string) ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ; // load partitions into Athena MSCK REPAIR TABLE impressions // Run sample query SELECT dt,impressionid FROM impressions WHERE dt<'2009-04-12-14-00' and dt>='2009-04-12-13-00'
  • 36. [4] Data is not partitioned aws s3 ls s3://athena-examples/elb/plaintext/ --recursive 2016-11-23 17:54:46 11789573 elb/plaintext/2015/01/01/part-r-00000-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 8776899 elb/plaintext/2015/01/01/part-r-00001-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9309800 elb/plaintext/2015/01/01/part-r-00002-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9412570 elb/plaintext/2015/01/01/part-r-00003-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 10725938 elb/plaintext/2015/01/01/part-r-00004-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:46 9439710 elb/plaintext/2015/01/01/part-r-00005-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 0 elb/plaintext/2015/01/01_$folder$ 2016-11-23 17:54:47 9012723 elb/plaintext/2015/01/02/part-r-00006-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 7571816 elb/plaintext/2015/01/02/part-r-00007-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:47 9673393 elb/plaintext/2015/01/02/part-r-00008-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 11979218 elb/plaintext/2015/01/02/part-r-00009-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt 2016-11-23 17:54:48 9546833 elb/plaintext/2015/01/02/part-r-00010-ce65fca5-d6c6-40e6-b1f9-190cc4f93814.txt ALTER TABLE elb_logs_raw_native_part ADD PARTITION (year='2015',month='01',day='01') location 's3://athena- examples/elb/plaintext/2015/01/01/'
  • 37. [5] AWS Athena [5] Converting to Columnar Formats Apache Parquet & ORC
  • 38. [5] Converting to Columnar Formats (batch data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Hive installed. ● In the step section of the cluster create statement, you can specify a script stored in Amazon S3, which points to your input data and creates output data in the columnar format in an Amazon S3 location. In this example, the cluster auto-terminates.
  • 39. [5] Converting to Columnar Formats (streaming data) Your Amazon Athena query performance improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. The process for converting to columnar formats using an EMR cluster is as follows: ● Create an EMR cluster with Spark ● Run Spark Streaming Job reading the data from Kinesis Stream and writing Parquet files on S3
  • 40. AWS Athena [6] Athena Security Authorization and Access
  • 41. [6] Athena Security Amazon offers three ways to control data access: ● AWS Identity and Access Management policies ● Access Control Lists ● Amazon S3 bucket policies Users are in control who can access data on S3. It’s possible to fine-tune security to allow different people to see different sets of data and also to grant access to other user’s data.
  • 42. AWS Athena [7] Service Limits Know your limits and mitigate the risk
  • 43. [7] Service Limits You can request a limit increase by contacting AWS Support. ● Currently, you can only submit one query at a time and you can only have 5 (five) concurrent queries at one time per account. ● Query timeout: 30 minutes ● Number of databases: 100 ● Table: 100 per database ● Number of partitions: 20k per table ● You may encounter a limit for Amazon S3 buckets per account, which is 100.
  • 44. [7] Known Limitations The following are known limitations in Amazon Athena ● User-defined functions (UDF or UDAFs) are not supported. ● Stored procedures are not supported. ● Currently, Athena does not support any transactions found in Hive or Presto. For a full list of keywords not supported, see Unsupported DDL. ● LZO is not supported. Use Snappy instead.
  • 45. [7] Avoid Surprises Use backticks if table names begin with an underscore. For example: CREATE TABLE myUnderScoreTable ( `_id` string, `_index`string, ... For the LOCATION clause, using a trailing slash USE s3://path_to_bucket/ DO NOT USE s3://path_to_bucket s3://path_to_bucket/* s3://path_to_bucket/mySpecialFile.dat
  • 46. AWS Athena [8] Comparing to Google BigQuery Know your limits and mitigate the risk
  • 47. DoIT International confidential │ Do not distribute Google BigQuery • Serverless Analytical Columnar Database based on Google Dremel • Data: • Native Tables • External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket) • Ingestion: • File Imports • Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet) • ANSI SQL 2011 • Priced at $5/TB of scanned data + storage + streaming (if used) • Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.
  • 48. DoIT International confidential │ Do not distribute Summary Feature Product AWS Athena Google BigQuery Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native ANSI SQL Support Yes* Yes* DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas) Underlying Technology FB Presto Google Dremel Caching No Yes Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity User Defined Functions No Yes Data Partitioning On Any Key By DAY Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data
  • 49. DoIT International confidential │ Do not distribute Test Drive Summary Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff % [1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51% [2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48% [3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27% [4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470% [5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100% [6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31% [7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%
  • 50. DoIT International confidential │ Do not distribute What Athena does better than BigQuery? Advantages: • Can be faster than BigQuery, especially with federated/external tables • Ability to use regex to define a schema (query files without needing to change the format) • Can be faster and cheaper than BigQuery when using a partitioned/columnar format • Tables can be partitioned on any column Issues: • It’s not easy to convert data between formats • Doesn’t support DDL, i.e. no insert/update/delete • No built-in ingestion
  • 51. DoIT International confidential │ Do not distribute What BigQuery does better than Athena? • It has native table support giving it better performance and more features • It’s easy to manipulate data, insert/update records and write query results back to a table • Querying native tables is very fast • Easy to convert non-columnar formats into a native table for columnar queries • Supports UDFs, although they will be available in the future for Athena • Supports nested tables (nested and repeated fields)
  • 52. Remember to complete your evaluations ;-) https://goo.gl/T9BZvy