In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes, and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
3. Data Is Changing → Analytics Are Adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics
4. Organizations that successfully generate business
value from their data will outperform their peers. An
Aberdeen survey saw organizations who implemented
a data lake outperforming similar companies by 9% in
organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: Driving Value from Data
5. Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
6. Data Lakes Extend the Traditional Approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
7. Data Lakes from AWS
Analytics
• Unmatched durability, and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grain access
• Fastest performance by retrieving subsets of data
• The most ways to bring data in
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement
8. Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage & Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services
9. Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
12. Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue
13. IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-in classifiers
MySQL
MariaDB
PostreSQL
Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPaths
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING…>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection
14. Data Lake on Amazon S3 with AWS Glue
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
15. Other Ways of Populating the Catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
16. But I have my own data formats …?
− There is a custom classifier for that …
Row-Based
GROK Classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML Classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON Classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using AWS Glue
supported operators
18. How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time data movementTraditional data movement
19. Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data movement from
real-time sources
Data movement from your
datacenters
Amazon S3
Amazon Glacier
AWS Glue
20. Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
21. Amazon S3
Amazon Glacier
AWS Glue
IMPORTANT: Ingest data in its raw form …
Open and comprehensive
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to warm and
cold storage.
22. Datasets in the Lake
Raw datasets – immutable datasets that you can always go back
to.
• Abstract out the complexities of how the data is stored
through the catalog and SerDes
Optimizing Analytics and Machine Learning:
Curated datasets – query-optimized for consumption across wide
number of tools
25. Different tools for different users … solving different problems
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Data Lake
Central Storage
SagemakerMachine Learning/Deep Learning
26. How Do I Drive Value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
28. Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
29. Exploring Data with Amazon Athena
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
32. Hadoop/Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage
33. Amazon S3 – Source of Truth, Multiple Clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage
35. Data processing with Amazon EMR (Spark)
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
40. Machine Learning with Amazon Sagemaker
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
41. Agility and Innovation Are Key
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement