2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개

AWS Big Data and Analytics
AmazonAthena Deep Dive
Greg Khairallah
AWS Business Development
Kurt Lee
Vingle Inc

Legacy data architectures exist as isolated data silos
Hadoop Cluster SQL Database
Data
Warehouse

Amazon S3 as your persistent data store
• Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
• Resize and shut down clusters with no dat
a loss
• Match use case to Analytics services
• Easily evolve your analytic infrastructure as
technology evolves
EMR
EMR
Amazon
S3
Amazon EMR
Amazon Athena
Amazon Redshift

Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
Query data in-place using
open file formats
Full Amazon Redshift
SQL support
S3
SQL

Amazon EMR – Hadoop, Spark, Presto in the Cloud
• Managed platform
• Launch a cluster in minutes
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize

Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

AWS Glue – Coming Soon
Data Catalog
 Hive metastore compatible metadata repository of data sources.
 Crawls data source to infer table, data type, partition format.
Job Execution
 Runs jobs in Spark containers – automatic scaling based on SLA.
 Glue is serverless - only pay for the resources you consume.
Job Authoring
 Generates Python code to move data from source to destination.
 Edit with your favorite IDE; share code snippets using Git.

Security
 Identity and Access Manage
ment (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to Am
azon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side Encrypti
on with provided keys
(SSE-C, SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management P
olicies
 Versioning & MFA delete
s
 Certifications – HIPAA, P
CI, SOC 1/2/3 etc.
Implement the right cloud security controls

How we’ve created and tamed (or tamed by)
the monster called big data

이상현 Kurt Lee
Vingle Inc
https://www.vingle.net
iOS / Frontend / Backend
Technical Leader
kurt@vingle.net
https://github.com/breath103

1. How to collect
2. Where to store
3. How to query
From millions of clients, with 3 developers, without loosing data, Cheaply
Really Stably, in TB scale
With plain SQL. hopefully

“Give me the things that i like”
(without asking me directly)
= Recommendation

At the beginning…
Direct Actions
(Like / Clip / Write / Block / Follow)
Really Complicated SQL

How about indirect actions?
{
content:{
type: 'post',
id: 12345,
position_x: 2,
position_y: 2
},
referral:{
category: 'newsfeed',
area: 'newsfeed'
},
action:{
type: 'impression',
}
}

{
content:{
type: 'post',
id: 12345,
},
referral:{
category: 'newsfeed',
area: ‘newsfeed’,
resource_id: ‘12345’
},
action:{
type: ‘read',
}
}
{
content:{
type: ‘webpage',
id: “http://www.rog..”,
},
referral:{
category: ‘card_show',
area: ‘card_show’,
resource_id: ‘12345’
},
action:{
type: ‘read',
duration: 5.6,
}
}

30,000
Record Per Minute
24,000,000
Byte Per Minute

a. Fluentd fails, the “Service” goes
down
b. Why are we spending EC2 for just
simple proxy api call
c. Do you even know how many
records coming in?
d. Are we losing data or not
e. Using ruby for making just network
call? really?
f. I want microservice

Fully managed
Auto scaling (Almost infinite)
Monitorable

Don’t forget that we’re startup

Redshift is magical
Other solution?
Even Redshift can gets slow

1. Too much old data (older than 3 months)
2. We don’t use it “THAT” intensively
3. “Managing” “Cluster”
4. So much “simple aggregation” jobs.
(count - group by - 1 day / 1 hour / 5 minutes)
5. Missing Hadoop Eco-System
Even Redshift can gets slow

Can’t i just use “SQL” to query JSON in S3?

If i need to rebuild “EVERYTHING”
1. Use Lambda / Kinesis Firehose to Collect Data
2. Put “EVERYTHING” at S3
3. Use Athena
1. Intensive / Frequent Query -> Redshift
2. Hadoop Echosystem -> AWS EMR
3. Complicated Data Pipeline -> AWS DataPipeline
1. Luigi / Oozie / Airflow..

No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault toler
ance built in
Serverless characteristics

Simple Pricing - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text fil
es
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning

Popular Customer Use Cases

Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Data Catalog
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
S3
Lambda

Log Aggregation with ETL
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Data Catalog
Glue
Crawler
Create partition
on S3
Query data
S3
Glue ETL

Real-Time Data Collection
S3
Athena
Data Catalog
Real-time events Store partitioned in S3
Trigger Lambda
Query data
Lambda
Kinesis

Data Export
S3
Athena
Data Catalog
Database Migration Exported tables in S3
Trigger Lambda
Query data
Lambda
Database Migration
Service

SaaS Model
S3
Athena
Data Catalog
Query data
Hot data
Warm & cold dataApplication request

Analytics Reporting*
Athena
Data Catalog
Redshift
Spectrum
EMR
QuickSight
API

Summary of AWS Analytics, Database & AI Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition
Deep Learning-based Image Recognition
Amazon Lex
Voice or Text Chatbots

2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개

Similar to 2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개 (20)

More from Amazon Web Services Korea

More from Amazon Web Services Korea (20)

Recently uploaded

Recently uploaded (20)

2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개