SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Š 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Abhishek Sinha (@abysinha, sinhaar@)
Senior PM, Amazon Athena
December 14, 2016
Introduction to Amazon Athena
Interactive, Serverless, Pay-per-use, Query Service
What to Expect from the Session
• Overview of Amazon Athena
• Key Features
• Customer Examples
• Troubleshooting Query errors
• Q&A
Challenges Customers Faced
• Significant amount of work required to analyze data in
Amazon S3
• Users often only have access to aggregated data sets
• Managing a Hadoop cluster or data warehouse requires
expertise
Introducing Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades
Amazon Athena is Easy To Use
• Log into the Console
• Create a table
• Type in a Hive DDL Statement
• Use the console Add Table wizard
• Start querying
Amazon Athena is Highly Available
• You connect to a service endpoint or log into the console
• Athena uses warm compute pools across multiple
Availability Zones
• Your data is in Amazon S3, which is also highly available
and designed for 99.999999999% durability
Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the best
performance and lowest cost
• No ETL required
• Stream data from directly from Amazon S3
• Take advantage of Amazon S3 durability and availability
Use ANSI SQL
• Start writing ANSI SQL
• Support for complex joins, nested
queries & window functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data by
any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour or
Customer Key, Date
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Amazon Athena Supports Multiple Data Formats
• Text files, e.g., CSV, raw logs
• Apache Web Logs, TSV files
• JSON (simple, nested)
• Compressed files
• Columnar formats such as Apache Parquet & Apache ORC
• AVRO support – coming soon
Amazon Athena is Fast
• Tuned for performance
• Automatically parallelizes
queries
• Results are streamed to console
• Results also stored in S3
• Improve Query performance
• Compress your data
• Use columnar formats
Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions
• Any one looking to process data stored in Amazon S3
• Data coming IOT Devices, Apache Logs, Omniture logs, CF logs,
Application Logs
• Anyone who knows SQL
• Both developers or Analysts
• Ad-hoc exploration of data and data discovery
• Customers looking to build a data lake on Amazon S3
Who is Athena for ?
A Sample Pipeline
A Sample Pipeline
Ad-hoc access to raw data using SQL
A Sample Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
Re-visiting Challenges
Significant amount of work required to analyze data in Amazon S3
No ETL required. No loading of data. Query data where it lives
Users often only have access to aggregated data sets
Query data at whatever granularity you want
Managing a Hadoop cluster or data warehouse requires expertise
No infrastructure to manage
Accessing Amazon Athena
Use the JDBC Driver
QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-
premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight
JDBC also Provides Programmatic Access
/* Setup the driver */
Properties info = new Properties();
info.put("user", "AWSAccessKey");
info.put("password", "AWSSecretAccessKey");
info.put("s3_staging_dir", "s3://S3 Bucket Location/");
Class.forName("com.amazonaws.athena.jdbc.AthenaDriver");
Connection connection = DriverManager.getConnection("jdbc:awsathena://athena.us-east-
1.amazonaws.com:443/", info);
Creating a Table and Executing a Query
/* Create a table */
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery("CREATE EXTERNAL TABLE tableName ( Col1
String ) LOCATION ‘s3://bucket/tableLocation");
/* Execute a Query */
Statement statement = connection.createStatement();
ResultSet queryResults = statement.executeQuery("SELECT * FROM cloudfront_logs");
Creating Tables and Querying Data
Creating Tables - Concepts
• Create Table Statements (or DDL) are written in Hive
• High degree of flexibility
• Schema on Read
• Hive is SQL like but allows other concepts such “external
tables” and partitioning of data
• Data formats supported – JSON, TXT, CSV, TSV, Parquet
and ORC (via Serdes)
• Data in stored in Amazon S3
• Metadata is stored in an a metadata store
Athena’s Internal Metadata Store
• Stores Metadata
• Table definition, column names, partitions
• Highly available and durable
• Requires no management
• Access via DDL statements
• Similar to a Hive Metastore
Running Queries is Simple
Run time
and data
scanned
PARQUET
• Columnar format
• Schema segregated into footer
• Column major format
• All data is pushed to the leaf
• Integrated compression and
indexes
• Support for predicate
pushdown
Apache Parquet and Apache ORC – Columnar Formats
ORC
• Apache Top level project
• Schema segregated into footer
• Column major with stripes
• Integrated compression,
indexes, and stats
• Support for Predicate
Pushdown
Converting to ORC and PARQUET
• You can use Hive CTAS to convert data
• CREATE TABLE new_key_value_store
• STORED AS PARQUET
• AS
• SELECT col_1, col2, col3 FROM noncolumartable
• SORT BY new_key, key_value_pair;
• You can also use Spark to convert the file into PARQUET / ORC
• 20 lines of Pyspark code, running on EMR
• Converts 1TB of text data into 130 GB of Parquet with snappy conversion
• Total cost $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Pay By the Query - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Use Cases
Athena Complements Amazon Redshift & Amazon EMR
Amazon S3
EMR Athena
QuickSight
Redshift
Customers Using Athena
DataXu – 180TB of Log Data per Day
CDN
Real Time
Bidding
Retargeting
Platform
Kinesis Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Ecosystem of tools and services
Amazon Athena
Tips and Tricks
Created a Table, but do not see data
• Verify that the input LOCATION is correct on S3
• Buckets should be specified as s3://name/ or s3://name/subfolder/
• s3://us-east-1.amazonaws.com/bucket/path will throw an error s3://bucket/path/
• Did the table have partitions ?
• MSCK Repair Table for
• s3://mybucket/athena/inputdata/year=2016/data.csv
• s3://mybucket/athena/inputdata/year=2015/data.csv
• s3://mybucket/athena/inputdata/year=2014/data.csv
• Alter Table Add Partition for
• s3://mybucket/athena/inputdata/2016/data.csv
• s3://mybucket/athena/inputdata/2015/data.csv
• s3://mybucket/athena/inputdata/2014/data.csv
• ALTER TABLE Employee ADD
• PARTITION (year=2016) LOCATION s3://mybucket/athena/inputdata/2016/'
• PARTITION (year=2015) LOCATION s3://mybucket/athena/inputdata/2015/'
• PARTITION (year=2014) LOCATION s3://mybucket/athena/inputdata/2014/'
How did you define your partitions
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
CREATE EXTERNAL TABLE Employee (
Id INT,
Name STRING,
Address STRING,
INT Year
) PARTITIONED BY (year INT)
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','
LOCATION
‘s3://mybucket/athena/inputdata/’;
Reading JSON Data
• Make sure you are using the right Serde
• Native JSON Serde org.apache.hive.hcatalog.data.JsonSerDe
• OpenX SerDe (org.openx.data.jsonserde.JsonSerDe)
• Make sure JSON record is a single line
• Generate your data in case-insensitive columns
• Provide an option to ignore malformed records
CREATE EXTERNAL TABLE json (
a string,
b int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://bucket/path/';
Access Denied Issues
Check the IAM Policy
Refer to the Getting Started Documentation
Check the bucket ACL
Both the read bucket and write bucket
Table names
• Use backticks if table names begin with an underscore.
For example:
CREATE TABLE myUnderScoreTable (
`_id` string,
`_index`string,
• Table name can only have underscores as special characters
Thank you!
sinhaar@amazon.com
@abysinha

Weitere ähnliche Inhalte

Was ist angesagt?

민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
Amazon Web Services Korea
 
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
Amazon Web Services Korea
 

Was ist angesagt? (20)

What is AWS Glue
What is AWS GlueWhat is AWS Glue
What is AWS Glue
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
AWS for Backup and Recovery
AWS for Backup and RecoveryAWS for Backup and Recovery
AWS for Backup and Recovery
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive AWS Data Transfer Services Deep Dive
AWS Data Transfer Services Deep Dive
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기Aws glue를 통한 손쉬운 데이터 전처리 작업하기
Aws glue를 통한 손쉬운 데이터 전처리 작업하기
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
AWS Cloud 환경으로​ DB Migration 전략 수립하기
AWS Cloud 환경으로​ DB Migration 전략 수립하기AWS Cloud 환경으로​ DB Migration 전략 수립하기
AWS Cloud 환경으로​ DB Migration 전략 수립하기
 
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
AWS DMS를 통한 오라클 DB 마이그레이션 방법 - AWS Summit Seoul 2017
 
Oracle DB를 AWS로 이관하는 방법들 - 서호석 클라우드 사업부/컨설팅팀 이사, 영우디지탈 :: AWS Summit Seoul 2021
Oracle DB를 AWS로 이관하는 방법들 - 서호석 클라우드 사업부/컨설팅팀 이사, 영우디지탈 :: AWS Summit Seoul 2021Oracle DB를 AWS로 이관하는 방법들 - 서호석 클라우드 사업부/컨설팅팀 이사, 영우디지탈 :: AWS Summit Seoul 2021
Oracle DB를 AWS로 이관하는 방법들 - 서호석 클라우드 사업부/컨설팅팀 이사, 영우디지탈 :: AWS Summit Seoul 2021
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Intro to AWS Lambda
Intro to AWS Lambda Intro to AWS Lambda
Intro to AWS Lambda
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
Amazon OpenSearch Deep dive - 내부구조, 성능최적화 그리고 스케일링
 
AWS EC2
AWS EC2AWS EC2
AWS EC2
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 

Andere mochten auch

ServerlessPresentation
ServerlessPresentationServerlessPresentation
ServerlessPresentation
Rohit Kumar
 

Andere mochten auch (20)

Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
AWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon AthenaAWS Black Belt Online Seminar 2017 Amazon Athena
AWS Black Belt Online Seminar 2017 Amazon Athena
 
Aws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon AthenaAws Atlanta meetup Amazon Athena
Aws Atlanta meetup Amazon Athena
 
Big Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon AthenaBig Data answers in seconds with Amazon Athena
Big Data answers in seconds with Amazon Athena
 
Aws meetup aws_waf
Aws meetup aws_wafAws meetup aws_waf
Aws meetup aws_waf
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
Towards Full Stack Security
Towards Full Stack SecurityTowards Full Stack Security
Towards Full Stack Security
 
Digital Transformation through Product and Service Innovation
Digital Transformation through Product and Service InnovationDigital Transformation through Product and Service Innovation
Digital Transformation through Product and Service Innovation
 
Getting started with Amazon Redshift
Getting started with Amazon RedshiftGetting started with Amazon Redshift
Getting started with Amazon Redshift
 
Getting started with aws io t.compressed.compressed
Getting started with aws io t.compressed.compressedGetting started with aws io t.compressed.compressed
Getting started with aws io t.compressed.compressed
 
Insider
InsiderInsider
Insider
 
如何快速開發與測試App
如何快速開發與測試App如何快速開發與測試App
如何快速開發與測試App
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
ServerlessPresentation
ServerlessPresentationServerlessPresentation
ServerlessPresentation
 
Combining OpenWhisk (serverless), Open API (swagger) and API Connect to build...
Combining OpenWhisk (serverless), Open API (swagger) and API Connect to build...Combining OpenWhisk (serverless), Open API (swagger) and API Connect to build...
Combining OpenWhisk (serverless), Open API (swagger) and API Connect to build...
 
Openstack taskflow 簡介
Openstack taskflow 簡介Openstack taskflow 簡介
Openstack taskflow 簡介
 
The Connected Home: Managing and Innovating with Offline Devices
The Connected Home: Managing and Innovating with Offline DevicesThe Connected Home: Managing and Innovating with Offline Devices
The Connected Home: Managing and Innovating with Offline Devices
 
The Serverless Paradigm, OpenWhisk and FIWARE
The Serverless Paradigm, OpenWhisk and FIWAREThe Serverless Paradigm, OpenWhisk and FIWARE
The Serverless Paradigm, OpenWhisk and FIWARE
 

Ähnlich wie Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
Amazon Web Services
 

Ähnlich wie Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL (20)

NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Los Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep DiveLos Angeles AWS Users Group - Athena Deep Dive
Los Angeles AWS Users Group - Athena Deep Dive
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
What is Amazon Athena
What is Amazon AthenaWhat is Amazon Athena
What is Amazon Athena
 
Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017Denver AWS Users' Group meeting - September 2017
Denver AWS Users' Group meeting - September 2017
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3Querying and Analyzing Data in Amazon S3
Querying and Analyzing Data in Amazon S3
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
Replicate & Manage Data Using Managed Databases & Serverless Technologies (DA...
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalitĂ  Server...
Big Data per le Startup: come creare applicazioni Big Data in modalitĂ  Server...Big Data per le Startup: come creare applicazioni Big Data in modalitĂ  Server...
Big Data per le Startup: come creare applicazioni Big Data in modalitĂ  Server...
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 

KĂźrzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

KĂźrzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL

  • 1. Š 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Abhishek Sinha (@abysinha, sinhaar@) Senior PM, Amazon Athena December 14, 2016 Introduction to Amazon Athena Interactive, Serverless, Pay-per-use, Query Service
  • 2. What to Expect from the Session • Overview of Amazon Athena • Key Features • Customer Examples • Troubleshooting Query errors • Q&A
  • 3. Challenges Customers Faced • Significant amount of work required to analyze data in Amazon S3 • Users often only have access to aggregated data sets • Managing a Hadoop cluster or data warehouse requires expertise
  • 4. Introducing Amazon Athena Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 5. Athena is Serverless • No Infrastructure or administration • Zero Spin up time • Transparent upgrades
  • 6. Amazon Athena is Easy To Use • Log into the Console • Create a table • Type in a Hive DDL Statement • Use the console Add Table wizard • Start querying
  • 7. Amazon Athena is Highly Available • You connect to a service endpoint or log into the console • Athena uses warm compute pools across multiple Availability Zones • Your data is in Amazon S3, which is also highly available and designed for 99.999999999% durability
  • 8. Query Data Directly from Amazon S3 • No loading of data • Query data in its raw format • Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No ETL required • Stream data from directly from Amazon S3 • Take advantage of Amazon S3 durability and availability
  • 9. Use ANSI SQL • Start writing ANSI SQL • Support for complex joins, nested queries & window functions • Support for complex data types (arrays, structs) • Support for partitioning of data by any key • (date, time, custom keys) • e.g., Year, Month, Day, Hour or Customer Key, Date
  • 10. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 11. Amazon Athena Supports Multiple Data Formats • Text files, e.g., CSV, raw logs • Apache Web Logs, TSV files • JSON (simple, nested) • Compressed files • Columnar formats such as Apache Parquet & Apache ORC • AVRO support – coming soon
  • 12. Amazon Athena is Fast • Tuned for performance • Automatically parallelizes queries • Results are streamed to console • Results also stored in S3 • Improve Query performance • Compress your data • Use columnar formats
  • 13. Amazon Athena is Cost Effective • Pay per query • $5 per TB scanned from S3 • DDL Queries and failed queries are free • Save by using compression, columnar formats, partitions
  • 14. • Any one looking to process data stored in Amazon S3 • Data coming IOT Devices, Apache Logs, Omniture logs, CF logs, Application Logs • Anyone who knows SQL • Both developers or Analysts • Ad-hoc exploration of data and data discovery • Customers looking to build a data lake on Amazon S3 Who is Athena for ?
  • 16. A Sample Pipeline Ad-hoc access to raw data using SQL
  • 17. A Sample Pipeline Ad-hoc access to data using Athena Athena can query aggregated datasets as well
  • 18. Re-visiting Challenges Significant amount of work required to analyze data in Amazon S3 No ETL required. No loading of data. Query data where it lives Users often only have access to aggregated data sets Query data at whatever granularity you want Managing a Hadoop cluster or data warehouse requires expertise No infrastructure to manage
  • 20. Use the JDBC Driver
  • 21. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on- premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. JDBC also Provides Programmatic Access /* Setup the driver */ Properties info = new Properties(); info.put("user", "AWSAccessKey"); info.put("password", "AWSSecretAccessKey"); info.put("s3_staging_dir", "s3://S3 Bucket Location/"); Class.forName("com.amazonaws.athena.jdbc.AthenaDriver"); Connection connection = DriverManager.getConnection("jdbc:awsathena://athena.us-east- 1.amazonaws.com:443/", info);
  • 28. Creating a Table and Executing a Query /* Create a table */ Statement statement = connection.createStatement(); ResultSet queryResults = statement.executeQuery("CREATE EXTERNAL TABLE tableName ( Col1 String ) LOCATION ‘s3://bucket/tableLocation"); /* Execute a Query */ Statement statement = connection.createStatement(); ResultSet queryResults = statement.executeQuery("SELECT * FROM cloudfront_logs");
  • 29. Creating Tables and Querying Data
  • 30. Creating Tables - Concepts • Create Table Statements (or DDL) are written in Hive • High degree of flexibility • Schema on Read • Hive is SQL like but allows other concepts such “external tables” and partitioning of data • Data formats supported – JSON, TXT, CSV, TSV, Parquet and ORC (via Serdes) • Data in stored in Amazon S3 • Metadata is stored in an a metadata store
  • 31. Athena’s Internal Metadata Store • Stores Metadata • Table definition, column names, partitions • Highly available and durable • Requires no management • Access via DDL statements • Similar to a Hive Metastore
  • 32. Running Queries is Simple Run time and data scanned
  • 33. PARQUET • Columnar format • Schema segregated into footer • Column major format • All data is pushed to the leaf • Integrated compression and indexes • Support for predicate pushdown Apache Parquet and Apache ORC – Columnar Formats ORC • Apache Top level project • Schema segregated into footer • Column major with stripes • Integrated compression, indexes, and stats • Support for Predicate Pushdown
  • 34. Converting to ORC and PARQUET • You can use Hive CTAS to convert data • CREATE TABLE new_key_value_store • STORED AS PARQUET • AS • SELECT col_1, col2, col3 FROM noncolumartable • SORT BY new_key, key_value_pair; • You can also use Spark to convert the file into PARQUET / ORC • 20 lines of Pyspark code, running on EMR • Converts 1TB of text data into 130 GB of Parquet with snappy conversion • Total cost $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 35. Pay By the Query - $5/TB Scanned • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 37. Athena Complements Amazon Redshift & Amazon EMR Amazon S3 EMR Athena QuickSight Redshift
  • 39. DataXu – 180TB of Log Data per Day CDN Real Time Bidding Retargeting Platform Kinesis Attribution & ML S3 Reporting Data Visualization Data Pipeline ETL(Spark SQL) Ecosystem of tools and services Amazon Athena
  • 40.
  • 42. Created a Table, but do not see data • Verify that the input LOCATION is correct on S3 • Buckets should be specified as s3://name/ or s3://name/subfolder/ • s3://us-east-1.amazonaws.com/bucket/path will throw an error s3://bucket/path/ • Did the table have partitions ? • MSCK Repair Table for • s3://mybucket/athena/inputdata/year=2016/data.csv • s3://mybucket/athena/inputdata/year=2015/data.csv • s3://mybucket/athena/inputdata/year=2014/data.csv • Alter Table Add Partition for • s3://mybucket/athena/inputdata/2016/data.csv • s3://mybucket/athena/inputdata/2015/data.csv • s3://mybucket/athena/inputdata/2014/data.csv • ALTER TABLE Employee ADD • PARTITION (year=2016) LOCATION s3://mybucket/athena/inputdata/2016/' • PARTITION (year=2015) LOCATION s3://mybucket/athena/inputdata/2015/' • PARTITION (year=2014) LOCATION s3://mybucket/athena/inputdata/2014/'
  • 43. How did you define your partitions CREATE EXTERNAL TABLE Employee ( Id INT, Name STRING, Address STRING ) PARTITIONED BY (year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘s3://mybucket/athena/inputdata/’; CREATE EXTERNAL TABLE Employee ( Id INT, Name STRING, Address STRING, INT Year ) PARTITIONED BY (year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION ‘s3://mybucket/athena/inputdata/’;
  • 44. Reading JSON Data • Make sure you are using the right Serde • Native JSON Serde org.apache.hive.hcatalog.data.JsonSerDe • OpenX SerDe (org.openx.data.jsonserde.JsonSerDe) • Make sure JSON record is a single line • Generate your data in case-insensitive columns • Provide an option to ignore malformed records CREATE EXTERNAL TABLE json ( a string, b int ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true') LOCATION 's3://bucket/path/';
  • 45. Access Denied Issues Check the IAM Policy Refer to the Getting Started Documentation Check the bucket ACL Both the read bucket and write bucket
  • 46. Table names • Use backticks if table names begin with an underscore. For example: CREATE TABLE myUnderScoreTable ( `_id` string, `_index`string, • Table name can only have underscores as special characters