SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alex Coqueiro
Public Sector Solutions Architecture Team
Amazon Web Services
BDA305
Build Data Lakes and Analytics on AWS:
Patterns & Best Practices
VisualizationVariability
Big Data Is Defined Many Different Ways
Volume Velocity Variety Veracity Value
Data Is Changing → Analytics Are Adopting
Capture and store
new data at PB-EB scale
Do new type of analytics in
a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics
Organizations that successfully generate business
value from their data will outperform their peers. An
Aberdeen survey saw organizations who implemented
a data lake outperforming similar companies by 9% in
organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Most Important: Driving Value from Data
Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
Data Lakes Extend the Traditional Approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning
Data Lakes from AWS
Analytics
• Unmatched durability, and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grain access
• Fastest performance by retrieving subsets of data
• The most ways to bring data in
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement
Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage & Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services
Data Lakes and Analytics Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What data do I have?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Gartner:
“Through 2018, 80% of data lakes will not include effective
metadata management capabilities, making them inefficient."
What Data Do I Have?
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue
IAM Role
AWS Glue Crawler Databases
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-in classifiers
MySQL
MariaDB
PostreSQL
Aurora
Oracle
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & JSONPaths
AWS CloudTrail
BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
< ALWAYS GROWING…>
What can crawlers discover?
Create additional custom
classifiers
Amazon
DynamoDB
NoSQL Connection
Data Lake on Amazon S3 with AWS Glue
On-premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
Other Ways of Populating the Catalog
Call the AWS Glue CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
But I have my own data formats …?
− There is a custom classifier for that …
Row-Based
GROK Classifier
A grok pattern is a
named set of regular
expressions (regex)
that are used to match
data one line at a time.
XML
XML Classifier
XML tag that defines a
table row in the XML
document.
JSON
JSON Classifier
JSON path to the
object, array, or value
that defines a row of
the table being
created. Type the
name in either dot or
bracket JSON syntax
using AWS Glue
supported operators
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How do I hydrate my Data Lake?
How do I drive value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time data movementTraditional data movement
Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data movement from
real-time sources
Data movement from your
datacenters
Amazon S3
Amazon Glacier
AWS Glue
Amazon
Kinesis Data
Firehose
Real-time data movement and Data Lakes on AWS
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library
Amazon S3
Amazon Glacier
AWS Glue
IMPORTANT: Ingest data in its raw form …
Open and comprehensive
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to warm and
cold storage.
Datasets in the Lake
Raw datasets – immutable datasets that you can always go back
to.
• Abstract out the complexities of how the data is stored
through the catalog and SerDes
Optimizing Analytics and Machine Learning:
Curated datasets – query-optimized for consumption across wide
number of tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Raw data stored in Data Lake:
Preparation:
No rmalized
Partitio ned
Co mpressed
S to rage Optimized
Extract – Load – Transform
Preparing raw data for consumption
Data Lake
on AWS
Raw
Ingestion
Curated
DataSets
Data Catalog
ELT
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Which tool should I use to analyze my
data?
Different tools for different users … solving different problems
Business
Reporting
Data Scientists
Data Engineer
IDE
Data
Catalog
Data Lake
Central Storage
SagemakerMachine Learning/Deep Learning
How Do I Drive Value?
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Exploring Data with Amazon Athena
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EMR – Enterprise - Hadoop & Spark
Deploy latest releases in Hadoop and Spark ecosystemsHadoop
Ganglia
HBase
Hive&
Catalog
Hue
Mahout
Oozie
Phoenix
Pig
Presto
Spark
Tez
Zeppelin
Zookeeper
Flink
Livy
MXNet
Sqoop
Emr-4.0.0
July2015
2.6.0 1.0.0 0.10.0 0.14.0 1.4.1
Emr-4.7.0
June2016
2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8
Emr-5.3.0
January2017
2.7.3 3.7.2
1.2.3
+
S3
2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4
Emr-5.14.0
June2018
2.8.3 3.7.2
1.4.2
+
S3
2.3.2 4.1.0 0.13.0 4.3.0 4.13.0 0.17.0 0.194 2.3.0 1.4.7 0.8.4 0.7.3 3.4.10 1.4.2 0.4.0 1.1.0
EMR releases
• Nineteen open-source
projects: Apache Hadoop,
Spark, HBase, Presto, and
more
• Updated with the latest
open source frameworks
within 30 days of release
Hadoop/Spark Analytics on AWS
YARN (Hadoop Resource Manager)
NoSQLMachine
learning
Real-timeInteractiveScriptBatch
Data Lake
on AWS
Amazon S3
Amazon EMR
Managed Hadoop/Spark
Object Storage
Amazon S3 – Source of Truth, Multiple Clusters
Amazon S3
Interactive Spark Cluster
Amazon EMR
Amazon EMR
HDFS
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
HDFS
EC2 Instance Memory
Intermediates stored
on local disk or HDFSLocal
Transient ETL Job
Source of Truth
HDFS
HDFS
HDFS
Local Intermediate HDFS/Storage
Local Intermediate HDFS/Storage
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fitting this into the Common Data Catalog
Amazon S3
Interactive Spark cluster
Amazon EMR
Amazon EMR
EMRFS
HDFS
Transient ETL job
Source of Truth
EMRFS
HDFS
Describes the data
MySQL DB
instance
Unifieddataview
AWS Glue
Data Catalog
Stores the data
…
Data processing with Amazon EMR (Spark)
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What if I implement machine learning to
identify complex business insights?
Machine Learning on Your Data Lake
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Vision
AWS Machine Learning
Frameworks &
Infrastructure
Services GPU MobileCPU IoT (Greengrass)
Platform
Services
Application
Services
Amazon SageMaker
Rekognition
Image
Rekognition
Video
Speech
Polly Transcribe
Language
Translate ComprehendLex
TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
1 2 3 4
I I I I
Notebook Instances Algorithms ML Training Service ML Hosting Service
Machine Learning with Amazon Sagemaker
Dados on-premise
Web app data
Amazon RDS
Outros Banco de
Dados
Streaming data
AMAZON QUICKSIGHT AMAZON
SAGEMAKER
Agility and Innovation Are Key
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS
Amazon S3 | AWS Glue
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
BDA305
Thank You !!!
Alex Coqueiro
Public Sector Solutions Architecture Team
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Please complete the session survey in the
summit mobile app.
Submit Session Feedback
1. Tap the Schedule icon. 2. Select the session you
attended.
3. Tap Session Evaluation to
submit your feedback.

Weitere ähnliche Inhalte

Was ist angesagt?

Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon Web Services
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Amazon Web Services
 
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018Amazon Web Services
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Amazon Web Services
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSAmazon Web Services
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Amazon Web Services
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Amazon Web Services
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Getting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and ServerlessGetting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and ServerlessAmazon Web Services
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceAmazon Web Services
 

Was ist angesagt? (20)

Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Aws security-pillar
Aws security-pillarAws security-pillar
Aws security-pillar
 
Security Architectures on AWS
Security Architectures on AWSSecurity Architectures on AWS
Security Architectures on AWS
 
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
Introducing AWS Transfer for SFTP, a Fully Managed SFTP Service for Amazon S3...
 
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
AWS Landing Zone Deep Dive (ENT350-R2) - AWS re:Invent 2018
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Introduction to AWS Security
Introduction to AWS SecurityIntroduction to AWS Security
Introduction to AWS Security
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...
 
Introduction to Amazon S3
Introduction to Amazon S3Introduction to Amazon S3
Introduction to Amazon S3
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Getting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and ServerlessGetting Started with AWS Lambda and Serverless
Getting Started with AWS Lambda and Serverless
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 

Ähnlich wie Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Amazon Web Services
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSAmazon Web Services
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Serverless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsServerless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsKristana Kane
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesAmazon Web Services
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

Ähnlich wie Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit (20)

BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
Build Data Lakes & Analytics on AWS: Patterns & Best Practices - BDA305 - Ana...
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
Building a Data Processing Pipeline on AWS - AWS Summit SG 2017
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Serverless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data AnalyticsServerless Big Data Architectures: Serverless Data Analytics
Serverless Big Data Architectures: Serverless Data Analytics
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
 
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate TorontoDatabase and Analytics on the AWS Cloud - AWS Innovate Toronto
Database and Analytics on the AWS Cloud - AWS Innovate Toronto
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Coqueiro Public Sector Solutions Architecture Team Amazon Web Services BDA305 Build Data Lakes and Analytics on AWS: Patterns & Best Practices
  • 2. VisualizationVariability Big Data Is Defined Many Different Ways Volume Velocity Variety Veracity Value
  • 3. Data Is Changing → Analytics Are Adopting Capture and store new data at PB-EB scale Do new type of analytics in a cost effective way • Machine learning • Big data processing • Real-time analytics • Full-text search New types of analytics
  • 4. Organizations that successfully generate business value from their data will outperform their peers. An Aberdeen survey saw organizations who implemented a data lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence Most Important: Driving Value from Data
  • 5. Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data warehouse Business intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc
  • 6. Data Lakes Extend the Traditional Approach Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and nonrelational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  • 7. Data Lakes from AWS Analytics • Unmatched durability, and availability at EB scale • Best security, compliance, and audit capabilities • Object-level controls for fine-grain access • Fastest performance by retrieving subsets of data • The most ways to bring data in • Analyze with broadest set of analytics & ML services Machine learning Real-time dataOn-premises Data Lake on AWS movementdata movement
  • 8. Managed ML Service Deep Learning AMIs Video and Image Recognition Conversational Interfaces Deep-Learning Video Camera Natural Language Processing Language Translation Speech Recognition Text-to-Speech Interactive Analysis Hadoop & Spark Data Warehousing Full-text search Real-time analytics Dashboards & Visualizations Dedicated Network connection Secure appliances Ruggedized Shipping Container Database migration Connect Devices to AWS Real-time Data Streams Real-time Video Streams Data Lake on AWS Storage & Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement Data Lakes and Analytics Portfolio from AWS Broadest, deepest set of analytic services
  • 9. Data Lakes and Analytics Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What data do I have?
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gartner: “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient." What Data Do I Have? Data Lake on AWS Storage | Archival Storage | Data Catalog
  • 12. Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawling Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy AWS Glue
  • 13. IAM Role AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-in classifiers MySQL MariaDB PostreSQL Aurora Oracle Amazon Redshift Avro Parquet ORC XML JSON & JSONPaths AWS CloudTrail BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) < ALWAYS GROWING…> What can crawlers discover? Create additional custom classifiers Amazon DynamoDB NoSQL Connection
  • 14. Data Lake on Amazon S3 with AWS Glue On-premises data Web app data Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 15. Other Ways of Populating the Catalog Call the AWS Glue CreateTable API Create table manually Run Hive DDL statement Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  • 16. But I have my own data formats …? − There is a custom classifier for that … Row-Based GROK Classifier A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. XML XML Classifier XML tag that defines a table row in the XML document. JSON JSON Classifier JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I hydrate my Data Lake?
  • 18. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time data movementTraditional data movement
  • 19. Ingest data based on the type of data Open and comprehensive • Data movement from on-premises datacenters • Dedicated network connection • Secure appliances • Ruggedized shipping container • Database migration • Gateway that lets applications write to the cloud • Data movement from real-time sources • Connect devices to AWS • Real-time data streams • Real-time video streams AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data movement from real-time sources Data movement from your datacenters Amazon S3 Amazon Glacier AWS Glue
  • 20. Amazon Kinesis Data Firehose Real-time data movement and Data Lakes on AWS AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  • 21. Amazon S3 Amazon Glacier AWS Glue IMPORTANT: Ingest data in its raw form … Open and comprehensive • Store the data in its raw form: • BEFORE • Transforming • Analyzing • Manipulating • Doing … anything … to it CSV ORC Grok Avro Parquet JSON • This becomes your source of record you can always go back to … • Lifecycle policies allow you to shift it to warm and cold storage.
  • 22. Datasets in the Lake Raw datasets – immutable datasets that you can always go back to. • Abstract out the complexities of how the data is stored through the catalog and SerDes Optimizing Analytics and Machine Learning: Curated datasets – query-optimized for consumption across wide number of tools
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Raw data stored in Data Lake: Preparation: No rmalized Partitio ned Co mpressed S to rage Optimized Extract – Load – Transform Preparing raw data for consumption Data Lake on AWS Raw Ingestion Curated DataSets Data Catalog ELT
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which tool should I use to analyze my data?
  • 25. Different tools for different users … solving different problems Business Reporting Data Scientists Data Engineer IDE Data Catalog Data Lake Central Storage SagemakerMachine Learning/Deep Learning
  • 26. How Do I Drive Value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena – interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 28. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 29. Exploring Data with Amazon Athena Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR – big data processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EMR – Enterprise - Hadoop & Spark Deploy latest releases in Hadoop and Spark ecosystemsHadoop Ganglia HBase Hive& Catalog Hue Mahout Oozie Phoenix Pig Presto Spark Tez Zeppelin Zookeeper Flink Livy MXNet Sqoop Emr-4.0.0 July2015 2.6.0 1.0.0 0.10.0 0.14.0 1.4.1 Emr-4.7.0 June2016 2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8 Emr-5.3.0 January2017 2.7.3 3.7.2 1.2.3 + S3 2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4 Emr-5.14.0 June2018 2.8.3 3.7.2 1.4.2 + S3 2.3.2 4.1.0 0.13.0 4.3.0 4.13.0 0.17.0 0.194 2.3.0 1.4.7 0.8.4 0.7.3 3.4.10 1.4.2 0.4.0 1.1.0 EMR releases • Nineteen open-source projects: Apache Hadoop, Spark, HBase, Presto, and more • Updated with the latest open source frameworks within 30 days of release
  • 32. Hadoop/Spark Analytics on AWS YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS Amazon S3 Amazon EMR Managed Hadoop/Spark Object Storage
  • 33. Amazon S3 – Source of Truth, Multiple Clusters Amazon S3 Interactive Spark Cluster Amazon EMR Amazon EMR HDFS HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal Transient ETL Job Source of Truth HDFS HDFS HDFS Local Intermediate HDFS/Storage Local Intermediate HDFS/Storage
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fitting this into the Common Data Catalog Amazon S3 Interactive Spark cluster Amazon EMR Amazon EMR EMRFS HDFS Transient ETL job Source of Truth EMRFS HDFS Describes the data MySQL DB instance Unifieddataview AWS Glue Data Catalog Stores the data …
  • 35. Data processing with Amazon EMR (Spark) Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What if I implement machine learning to identify complex business insights?
  • 37. Machine Learning on Your Data Lake Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 38. Vision AWS Machine Learning Frameworks & Infrastructure Services GPU MobileCPU IoT (Greengrass) Platform Services Application Services Amazon SageMaker Rekognition Image Rekognition Video Speech Polly Transcribe Language Translate ComprehendLex TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker 1 2 3 4 I I I I Notebook Instances Algorithms ML Training Service ML Hosting Service
  • 40. Machine Learning with Amazon Sagemaker Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  • 41. Agility and Innovation Are Key Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDA305 Thank You !!! Alex Coqueiro Public Sector Solutions Architecture Team Amazon Web Services
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the summit mobile app.
  • 44. Submit Session Feedback 1. Tap the Schedule icon. 2. Select the session you attended. 3. Tap Session Evaluation to submit your feedback.