Serverless Big Data Analytics with Amazon Athena and QuickSight

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Robinson, Specialist SA, Analytics, EMEA
10 April 2018
Serverless Big Data Analytics with
Amazon Athena and QuickSight

Amazon Athena
Amazon Athena is an
interactive query service
that makes it easy to
analyze data directly from
Amazon S3 using
standard SQL

Serverless and Easy to Use
No infrastructure or administration
• Warm compute pools across multiple AZs
• Data in S3, for HA and high durability
Zero spin up time
• Connect to a service endpoint
• Start querying

Familiar OSS Technology
Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning
• EXTERNAL tables – no impact on underlying data

ANSI SQL
• Complex joins, nested queries, window functions
• Complex data types (arrays, structs)
• Partitioning of data by any key
• (date, time, custom keys)
• e.g., Year, Month, Day, Hour, Customer Key, Date

Open Data Formats
• Text files, e.g. CSV, TSV, custom delimiter
• Apache Web Logs, CloudTrail logs
• JSON (simple, nested), AVRO
• Columnar formats, e.g. Apache Parquet & Apache ORC
• Logstash Grok for unstructured text files
• Compressed files (Snappy, Zlib, GZIP, and LZO)
• Encrypted data (SSE-S3, SSE-KMS, CSE-KMS)
• Use large (128MB – 1GB) compressed files

Pay Per Query – 5$ Per TB Data Scanned
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Apache Parquet and Apache ORC
• Columnar formats
• Store data in columns, not rows
• Support for predicate pushdown
• Filter data where it lives
• Schema segregated into footer
• Integrated compression and indexes

data answers
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
time to first answer
Analytics Value Stream

Agile Analytics
• Experiment
• Invest in promising experiments
• Fail fast
• React quickly

Serverless Analytics
AWSBrandGuidelines
CONFIDENTIAL
Donotcreatetitlesthatarelar
thannecessary.
Donotusetoosmallafont
sizeonmainorsubtext.
/complicated
Featureillustrations.
0%-80%ofthe
hitneyHTFfont.
Maintextgoeshere70%-80%ofthe
fontsizeofthetitle.WhitneyHTFfont.
subtextherethatexplainsmovement
orprocessinbetweensteps
STEPTITLEOFSTEP
Amazon S3 Highly durable object storage
AWS Glue Data catalog and managed ETL
Amazon Athena Serverless interactive SQL queries
Amazon QuickSight Business analytics service

Example: NYC Transportation
AWSBrandGuidelines
Donotcreatetitlesthatarelarger
thannecessary.
otuseoverlydetailed/complicated
ery;onlyusesimpleFeatureillustrations.
Maintextgoeshere
70%-80%ofthefontsizeof
thetitle.WhitneyHTFfont.
Maintextgoeshere
Maintextgoeshere
subtextheresubtexthere
TITLEOFSTEPTITLEOFSTEP
p.39
thannecessary.
re
izeof
font.
Maintextgoeshere
Maintextgoeshere
subtextheresubtexthere
PTITLEOFSTEPTITLEOFSTEP
TITLEOFSTEP
Raw S3
Data
Canonical
Data
Amazon
Athena
Amazon
Quicksigh
t
ETL Job
Data
Catalog
describes
describes
uses

Use Glue to Crawl and ETL the Source Data
Taxi
csv
Limo
csv
Taxi ETL Job
1.6 GB 94.8 MB
Limo ETL Job
220.3 MB 18 MB
Donotcreatetitlesthatarelarger llafont
xtgoeshere
ofthefontsizeof
hitneyHTFfont.
Maintextgoeshere
subtexthere
ovement
teps
TITLEOFSTEP
Canonical
Data parquet
Data Catalog
use

Start Querying with Amazon Athena
• Run Glue crawler to create
canonical table definition
• Run some simple queries
p.39
thannecessary.
allafont
btext.
textgoeshere
%ofthefontsizeof
WhitneyHTFfont.
Maintextgoeshere
subtexthere
smovement
nsteps
TITLEOFSTEP
Canonical
Data
Amazon
Athena
Data
Catalog
describes
uses

Visualise Your Data Lake with Amazon QuickSight
thannecessary.
smallafont
rsubtext.
aintextgoeshere
80%ofthefontsizeof
tle.WhitneyHTFfont.
Maintextgoeshere
subtexthere
TLEOFSTEPTITLEOFSTEP
plainsmovement
tweensteps
TITLEOFSTEP
Canonical
Data
Amazon
QuicksightData
Catalog
Collaborate, Share, and Publish

Serverless Big Data Analytics with Amazon Athena and QuickSight

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Serverless Big Data Analytics with Amazon Athena and QuickSight

Ähnlich wie Serverless Big Data Analytics with Amazon Athena and QuickSight (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Serverless Big Data Analytics with Amazon Athena and QuickSight