Building Your Data Lake on AWS - Level 200

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tom McMeekin
Solutions Architect, Amazon Web Services
Level 200
Building Your Data Lake on AWS

Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Data Lake

How to Unlock the Full Potential of Your Data
Customers are using data to
enable new and innovative
business capabilities.
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
Growing gap between data in
the Warehouse vs. what’s
across the Enterprise.

Evolution of Data Architectures Has Created Isolated Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance

Is HDFS the Right Choice for a Data Lake?
Multiple layers of functionality all on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node

Customer Challenges
Where is the single source of truth?
How can I collect data quickly from
various sources and store it efficiently?
How can I scale up with the volume of
data being generated?
Is there a way I can apply multiple
analytics and processing frameworks
to the same data?

What is a Data Lake?
“a single store for all of the raw data that anyone in an organisation
might need to analyse” - Martin Fowler
“The promise of a data lake is to provide a place to store the data…so it
will be available for analytics and data science” – Alex Gorelik
“If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various
users of the lake can come to examine, dive in, or take samples.” - James Dixon

Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access

Decouple Storage and Compute
CPU
Memory
Storage
CPU
Memory
Storage
CPU
Memory
Storage
Hadoop Master Node

Why Amazon S3 for Data Lake?
Scalable
Durability Availability
Integration
High performance
Easy to use

Processing & Analytics
Real Time
Amazon ElasticSearch Service
Amazon EMR
Spark Streaming / Apache Flink / Apache Storm
Amazon Kinesis Analytics
Amazon Machine Learning
Predictive Analytics
Amazon Rekognition
Deep Learning-based Image Recognition
Amazon Lex
Voice or Text Chatbots
AI & Predictive
Batch
Amazon Athena Amazon Redshift
Amazon EMR
Hadoop / Spark / Presto
Relational & NoSQL
Amazon RDS Amazon DynamoDB
BI & Data Visualisation
Amazon Quicksight

Important Components of a Data Lake
Catalogue
& Search
Protect
& Secure
Access &
User InterfaceIngest & Store

Data Ingestion into S3
AWS Direct Connect AWS Snowball
ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway

Security
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management
Policies
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the Right Cloud Security Controls

AWS
Lambda
Amazon
DynamoDB
+
Streams
Amazon
Elasticsearch
AWS
LambdaS3
Bucket
PUT
OBJECT
CREATE
OBJECT
PUT
ITEM
UPDATE
STREAM
UPDATE
INDEX
Populating Metadata and Search

User Access & User Interface Architecture
API
Gateway
AWS
Lambda
AWS
DynamoDB
Users
Cognito
S3

Central Storage
Secure, Cost Effective
Storage in S3
S3
Catalog & Search
Access & Search Metadata
DynamoDB Amazon ES
Access & User Interface
Give your users easy & secure access
API Gateway IAM Cognito
Protect & Secure
Use entitlements to ensure data is secure and users identities are verified
Security Token
Service
CloudWatch CloudTrail KMS
Athena QuickSight EMR Redshift
Processing & Analytics
Use predictive and prescriptive
analytics to gain better understanding
Firehose Direct Connect Snowball DMS
Data Ingestion
Get your data into S3
quickly and securely

Benefits of a Data Lake – All Data in One Place
Store and analyse all of your data,
from all of your sources, in one
centralised location.
“My data distributed in many
locations. Where is the single
source of truth?”

Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”

Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component
as required
“How can I scale up with the
volume of data being generated?”

Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.

AWS Solution Builder – Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,
search and catalogue datasets
Deploys a console to search and
browse available datasets
http://amzn.to/2nTVjcp

Building Your Data Lake on AWS - Level 200

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building Your Data Lake on AWS - Level 200

Ähnlich wie Building Your Data Lake on AWS - Level 200 (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building Your Data Lake on AWS - Level 200