SlideShare ist ein Scribd-Unternehmen logo
1 von 30
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daniel Haviv, Big Data Specialist Solutions Architect
June 2017
Interactive Analytics on AWS
Challenges Customers Faced
• Significant amount of work required to analyze data in
Amazon S3
• Users often only have access to aggregated data sets
• Managing a Hadoop cluster or data warehouse requires
expertise
Introducing Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
A Sample Pipeline
A Sample Pipeline
Ad-hoc access to raw data using SQL
A Sample Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
• Highly Available
• You connect to a service
endpoint or log into the
console
Amazon Athena is Easy To Use
• Log into the Console
• Create a table
• Type in a Hive DDL Statement
• Use the console Add Table wizard
• Start querying
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the best
performance and lowest cost
• No ETL required
• Stream data from directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions
• Anyone looking to process data stored in Amazon S3
• Data coming IOT Devices, Apache Logs, Omniture logs, CF logs,
Application Logs
• Anyone who knows SQL
• Both developers or Analysts
• Ad-hoc exploration of data and data discovery
• Customers looking to build a data lake on Amazon S3
Who is Athena for?
Accessing Amazon Athena
• Through the console
• Via the AWS API
• Using a JDBC client (either plain SQL client or BI tools)
Creating Tables and Querying Data
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
External = creates a view of this data.
When you delete the table, the data is
not deleted
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Location = where data is stored.
In Athena this is mandated to be
in Amazon S3
Partitioning in Athena
- Follows Hive Semantics – i.e. Partitioned by any column
- Partition Column is not part of the data itself, but it still
queriable.
- Creates subdirectories/prefixes for each partition
- (alter table add partition date=XXX year =XXX)
- MSCK Repair Table
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Partitioning allows you to limit what your
query runs on
Example
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING,month STRING, day STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/'
Defines how your files are stored
Columnar Formats
Flat File Sample Layout
First Name Last Name Age Gender
Tootsie Label 34 Female
Miriam Le	Fleming 25 Female
Blakeley Lisciandro 45 Female
Ernst Minghi 63 Male
Brew Jime 22 Male
Tootsie
Miriam
Blakeley
Ernst
Brew
Label
Le Fleming
Lisciandro
Minghi
Jime
34
25
45
63
22
Female
Female
Female
Male
Male
First Name Last Name Age Gender
MIN: Blakeley
MAX: Tootsie
MIN: Jime
MAX: Minghi
MIN: 22
MAX: 63
MIN: Female
MAX: Male
• Projection pushdown
• Predicate pushdown
• Improved compression ratios
Columnar Formats Layout (Parquet & ORC)
Converting to ORC and PARQUET
• You can use Hive CTAS to convert data
• CREATE TABLE new_key_value_store
• STORED AS PARQUET
• AS
• SELECT col_1, col2, col3 FROM noncolumartable
• SORT BY new_key, key_value_pair;
• You can also use Spark to convert the file into PARQUET / ORC
• 20 lines of Pyspark code, running on EMR
• Converts 1TB of text data into 130 GB of Parquet with snappy conversion
• Total cost $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Pay By the Query - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Amazon QuickSight is a Business Analytics Service that lets business users
quickly and easily visualize, explore, and share insights from their data.
QuickSight is designed to make it
easy and cost effective to enable
users across your organization to
consume data and perform ad-hoc
analytics.
Who Is QuickSight For?
Business Professionals
Data Consumers
Data
Professionals
QuickSight is deeply integrated
with AWS data sources like
Redshift, RDS, S3, Athena, EMR
(Spark & Presto) as well as third-
party sources like Excel,
Salesforce and more.
Deep Integration with
AWS Data Sources
Amazon RDS,
Aurora
Amazon
Redshift
Amazon
Athena
Amazon S3
Flat Files
Amazon
EMR
Demo
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
High Performance Computing Implementation on AWS
High Performance Computing Implementation on AWSHigh Performance Computing Implementation on AWS
High Performance Computing Implementation on AWS
 
Log Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & KibanaLog Analytics with Amazon Elasticsearch Service & Kibana
Log Analytics with Amazon Elasticsearch Service & Kibana
 
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech TalksTackle Your Dark Data  Challenge with AWS Glue - AWS Online Tech Talks
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech Talks
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Introduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis AnalyticsIntroduction to Amazon Kinesis Analytics
Introduction to Amazon Kinesis Analytics
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon RedshiftData Warehousing in the Era of Big Data: Intro to Amazon Redshift
Data Warehousing in the Era of Big Data: Intro to Amazon Redshift
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
AWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache StormAWS Webcast - Amazon Kinesis and Apache Storm
AWS Webcast - Amazon Kinesis and Apache Storm
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming(WRK302) Event-Driven Programming
(WRK302) Event-Driven Programming
 

Ähnlich wie Interactive Analytics on AWS - AWS Summit Tel Aviv 2017

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 

Ähnlich wie Interactive Analytics on AWS - AWS Summit Tel Aviv 2017 (20)

NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
Serverless Big Data Analytics with Amazon Athena and QuickSight
Serverless Big Data Analytics with Amazon Athena and QuickSightServerless Big Data Analytics with Amazon Athena and QuickSight
Serverless Big Data Analytics with Amazon Athena and QuickSight
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Interactive Analytics on AWS - AWS Summit Tel Aviv 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Daniel Haviv, Big Data Specialist Solutions Architect June 2017 Interactive Analytics on AWS
  • 2. Challenges Customers Faced • Significant amount of work required to analyze data in Amazon S3 • Users often only have access to aggregated data sets • Managing a Hadoop cluster or data warehouse requires expertise
  • 3. Introducing Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 5. A Sample Pipeline Ad-hoc access to raw data using SQL
  • 6. A Sample Pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  • 7. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades • Highly Available • You connect to a service endpoint or log into the console
  • 8. Amazon Athena is Easy To Use • Log into the Console • Create a table • Type in a Hive DDL Statement • Use the console Add Table wizard • Start querying
  • 9. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data from directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  • 10. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  • 11. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 12. Amazon Athena is Cost Effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free • Save by using compression, columnar formats, partitions
  • 13. • Anyone looking to process data stored in Amazon S3 • Data coming IOT Devices, Apache Logs, Omniture logs, CF logs, Application Logs • Anyone who knows SQL • Both developers or Analysts • Ad-hoc exploration of data and data discovery • Customers looking to build a data lake on Amazon S3 Who is Athena for?
  • 14. Accessing Amazon Athena • Through the console • Via the AWS API • Using a JDBC client (either plain SQL client or BI tools)
  • 15. Creating Tables and Querying Data
  • 16. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' External = creates a view of this data. When you delete the table, the data is not deleted
  • 17. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' Location = where data is stored. In Athena this is mandated to be in Amazon S3
  • 18. Partitioning in Athena - Follows Hive Semantics – i.e. Partitioned by any column - Partition Column is not part of the data itself, but it still queriable. - Creates subdirectories/prefixes for each partition - (alter table add partition date=XXX year =XXX) - MSCK Repair Table
  • 19. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' Partitioning allows you to limit what your query runs on
  • 20. Example CREATE EXTERNAL TABLE access_logs ( ip_address String, request_time Timestamp, request_method String, request_path String, request_protocol String, response_code String, response_size String, referrer_host String, user_agent String ) PARTITIONED BY (year STRING,month STRING, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS TEXTFILE LOCATION 's3://YOUR-S3-BUCKET-NAME/access-log-processed/' Defines how your files are stored
  • 22. Flat File Sample Layout First Name Last Name Age Gender Tootsie Label 34 Female Miriam Le Fleming 25 Female Blakeley Lisciandro 45 Female Ernst Minghi 63 Male Brew Jime 22 Male
  • 23. Tootsie Miriam Blakeley Ernst Brew Label Le Fleming Lisciandro Minghi Jime 34 25 45 63 22 Female Female Female Male Male First Name Last Name Age Gender MIN: Blakeley MAX: Tootsie MIN: Jime MAX: Minghi MIN: 22 MAX: 63 MIN: Female MAX: Male • Projection pushdown • Predicate pushdown • Improved compression ratios Columnar Formats Layout (Parquet & ORC)
  • 24. Converting to ORC and PARQUET • You can use Hive CTAS to convert data • CREATE TABLE new_key_value_store • STORED AS PARQUET • AS • SELECT col_1, col2, col3 FROM noncolumartable • SORT BY new_key, key_value_pair; • You can also use Spark to convert the file into PARQUET / ORC • 20 lines of Pyspark code, running on EMR • Converts 1TB of text data into 130 GB of Parquet with snappy conversion • Total cost $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 25. Pay By the Query - $5/TB Scanned • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to Columnar format • Use partitioning Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 26. Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.
  • 27. QuickSight is designed to make it easy and cost effective to enable users across your organization to consume data and perform ad-hoc analytics. Who Is QuickSight For? Business Professionals Data Consumers Data Professionals
  • 28. QuickSight is deeply integrated with AWS data sources like Redshift, RDS, S3, Athena, EMR (Spark & Presto) as well as third- party sources like Excel, Salesforce and more. Deep Integration with AWS Data Sources Amazon RDS, Aurora Amazon Redshift Amazon Athena Amazon S3 Flat Files Amazon EMR
  • 29. Demo