In this session, we will take you through getting started with a Data Lake by looking at how you can ingest data to Amazon S3, query it with Amazon Athena and perform ETL operations on it using AWS Glue. We will be using the Redshift cluster from the previous session to export data to S3 to query.
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
AWS SSA Webinar 21 - Getting Started with Data lakes on AWS
1. Cobus Bernard
Sr Developer Advocate
Amazon Web Services
Getting Started with Data Lakes on
AWS
@cobusbernard
cobusbernard
cobusbernard
2. Agenda
What is a Data Lake
Storing data in S3
Steps to build a Data Lake
AWS Lakeformation
Demo
Q&A
3. A data lake is a
centralised repository
that allows you to store all your
structured and unstructured
data at any scale
4. Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs (1EB = 1,024 PB = 1,048,576 TB)
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
5. Build a secure data lake on Amazon S3
Amazon S3 Block
Public Access
Amazon S3
object lock
Amazon S3
object tags
Amazon S3
access points
• Multi-tenant bucket
• Dedicated access
points
• Custom permissions
from a virtual private
cloud (VPC)
• Controls public access
• Across AWS accounts &
individual S3 bucket
levels
• Specify any type of
public permissions via
ACL or policy
• Immutable
Amazon S3 objects
• Retention management
controls
• Data protection and
compliance
• Access control, lifecycle
policies, analysis
• Classify data, filter
objects
• Define replication
policies
FSx for Lustre
6.
7. Choosing the right data lake storage class
Select storage class by data pipeline stage
Raw data ETL
• Small log files
• Overwrites if synced
• Short lived
• Moved & deleted
• Batched & archived
Production
data lake Historical data
Amazon S3
Standard
Amazon S3
Standard
Amazon S3
Intelligent-Tiering
Amazon S3 Glacier or
S3 Glacier Deep Archive
• Data churn
• Small intermediates
• Multiple transforms
• Deletes <30 days
• Output to data lake
• Optimized sizes (MBs)
• Many users
• Unpredictable access
• Long-lived assets
• Hot to cool
• Historical assets
• ML model training
• Compliance/Audit
• Data protection
• Planned restores
Online cool data
Amazon S3 Standard
Infrequent Access
and One Zone-IA
• Replicated DR data
• Infrequently accessed
• Infrequent queries
• ML model training
8. Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
9. Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
10. Sample of steps required Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
11. Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
13. Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
14. Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
15. With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want
to load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
That’s why many customers are moving to a data lake architecture.
A data lake is an architectural approach that helps you manage multiple data types from a wide variety of sources, both structured and unstructured, through a unified set of tools, so it's readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
batch, interactive, online, search, in-memory and other processing engines.
It helps you manage multiple data types from a wide variety of sources, both structured and unstructured, through a unified set of tools
Data lakes allow you to break down data silos and bring data into a single central repository. You can store a wide variety of data formats, at any scale and at low cost. Data lakes provide you a single source of truth and allow you access to the same data using a variety of analytics and machine-learning tools.
Turns out there are a lot of steps involved in building data lakes
1/ Set up storage – Data lakes hold a massive amount of data. Before doing anything else, customers need to set up storage to hold all of that data. If they are using AWS they configure S3 buckets and partitions. If they are doing this on-prem they acquire hardware and set up large disk arrays to hold all of the data for their data lake.
2/ Move data -- Customers need to connect to different data sources on-premises, in the cloud, and on IoT devices. Then they need to collect and organize the relevant data sets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. Customers do this today with a collection of file transfer and ETL tools, like AWS Glue.
3/ Clean and prepare data -- Next, that data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. Customers need to clean, de-duplicate, and match related records. Today this is done using rigid and complex SQL statements that only work so well and are difficult to maintain. This process of collecting, cleaning, and transforming the incoming data is complex and must be manually monitored in order to avoid errors.
4/ Configure and enforce policies – Sensitive data must be secured according to compliance requirements. This means creating and applying data access, protection, and compliance policies to make sure you are meeting required standards. For example, restricting access to personally identifiable information (PII) at the table, column, or row level, encrypting all data, and keeping audit logs of who is accessing the data. Today customers use access control lists on S3 buckets or they use 3rd party encryption and access control software to secure the data. And for every analytics service that needs to access the data, customers need to create and maintain data access, protection and compliance policies for each one. For example, if you are running analysis against your data lake, using Redshift and Athena, you need to set up access control rules for each of these services.
5/ Make it easy to find data - Different people in your organizations, like analysts and data scientists, may have trouble finding and trusting data sets in the data lake. You need to make it easy for those end-users to easily find relevant and trusted data. To do this you must clearly label the data in a catalog of the data lake and provide users with the ability to access and analyze this data without making requests to IT.
Each of these steps involve a lot of work because today a lot of it is done manually. Customers can spend months building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and machine learning. With all these steps, a fully productive data lake can take months to implement.
TRANSITION: We’ve learned from the tens of thousands of customers running analytics on AWS that most customers that want to do analytics want to build a data lake, and many of them want this to be easier and faster than it is today.
A recent study by CrowdFlower who surveyed ~80 data scientists about their jobs. They found Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around ~80% of their time on preparing and managing data for analysis. They also found data preparation was the least enjoyable parts of their work! https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#6493d6c76f63
The question we have to ask ourselves is if we can make data preparation easier? Can we minimize the time that people are collecting data sets and cleaning/organizing their data?
Lake Formation automates many of the steps we discussed, allowing customers to get started with just a few clicks from a single, unified dashboard.
1/ Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster.
2/ Enforce security policies across multiple services: Once your data sources are setup, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications.
3/ Gain and manage new insights: With Michigan you build a data catalog that describes available data sets and their appropriate business uses. This makes your users more productive by helping them find the right data set to analyze.
By providing a catalog of your data and consistent security enforcement, Michigan makes it easier for your analysts and data scientists to combine multiple analytic tools, like Athena, Redshift, and EMR, across diverse data sets.
With just a few clicks, you can setup your data lake on Amazon S3 and start ingesting data that is readily queryable. Lake For
To get started, you go to the Michigan dashboard in the AWS console, add your data sources and then Michigan will crawl those sources and move the data into your new Amazon S3 data lake. Michigan uses machine learning to automatically lay out the data in Amazon S3 partitions, change it into formats for faster analytics, like Apache Parquet and ORC, and also de-duplicates and finds matching records to increase data quality.
From a single screen set up all of the permissions for your data lake and they will be implemented across all services accessing this data - analytics and machine learning services (Amazon Redshift, Amazon Athena, and Amazon EMR.) This reduces the hassle in re-defining policies across multiple services and provides consistent enforcement and compliance of those policies.
mation …
Blueprints heavily leverage the functionality in AWS Glue:
We use Glue crawlers and connections to connect and discover the raw data that needs to be ingested.
We use Glue code-gen and jobs to generate the ingest code to bring that data into the data lake.
We leverage the data catalog for organizing the metadata.
We have added a workflow construct to stitch together crawlers, jobs, and allow for monitoring for individual workflows.
They’re an natural extension of the AWS Glue capabilities.