Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.
Speakers:
Tom McMeekin, Associate Solutions Architect, Amazon Web Services
3. How to Unlock the Full Potential of Your Data
Customers are using data to
enable new and innovative
business capabilities.
0
200
400
600
800
1000
1200
Enterprise Data Data in Warehouse
Growing gap between data in
the Warehouse vs. what’s
across the Enterprise.
4. Evolution of Data Architectures Has Created Isolated Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance
5. Is HDFS the Right Choice for a Data Lake?
Multiple layers of functionality all on a single cluster
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Hadoop Master Node
6. Customer Challenges
Where is the single source of truth?
How can I collect data quickly from
various sources and store it efficiently?
How can I scale up with the volume of
data being generated?
Is there a way I can apply multiple
analytics and processing frameworks
to the same data?
7. What is a Data Lake?
“a single store for all of the raw data that anyone in an organisation
might need to analyse” - Martin Fowler
“The promise of a data lake is to provide a place to store the data…so it
will be available for analytics and data science” – Alex Gorelik
“If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various
users of the lake can come to examine, dive in, or take samples.” - James Dixon
8. Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access
9. Decouple Storage and Compute
CPU
Memory
Storage
CPU
Memory
Storage
CPU
Memory
Storage
Hadoop Master Node
19. Central Storage
Secure, Cost Effective
Storage in S3
S3
Catalog & Search
Access & Search Metadata
DynamoDB Amazon ES
Access & User Interface
Give your users easy & secure access
API Gateway IAM Cognito
Protect & Secure
Use entitlements to ensure data is secure and users identities are verified
Security Token
Service
CloudWatch CloudTrail KMS
Athena QuickSight EMR Redshift
Processing & Analytics
Use predictive and prescriptive
analytics to gain better understanding
Firehose Direct Connect Snowball DMS
Data Ingestion
Get your data into S3
quickly and securely
20. Benefits of a Data Lake – All Data in One Place
Store and analyse all of your data,
from all of your sources, in one
centralised location.
“My data distributed in many
locations. Where is the single
source of truth?”
21. Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
22. Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component
as required
“How can I scale up with the
volume of data being generated?”
23. Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
24. AWS Solution Builder – Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,
search and catalogue datasets
Deploys a console to search and
browse available datasets
http://amzn.to/2nTVjcp