Data Lake
The new, the old
End to end pipeline
Lake formation
Security
Glue Blueprints
Lambda
ML
Who doesn’t have an AWS account?
Dark data
Schema on read means means defined in a catalog
Parquet preferred
Compute managed, but options
Business decisions were made around the Enterprise Data warehousing with BI tools.
Less relational, more diverse.
10x every 5 years
Who has access/what type of access
Business decisions were made around the Enterprise Data warehousing with BI tools.
Less relational, more diverse.
10x every 5 years
Who has access/what type of access
Datalake is the evolution of data warehousing.
1/ easy path to build a data lake and start running diverse analytics workloads,
2/ secure cloud storage, compute, and network infrastructure that meets the specific needs of analytic workloads,
3/ a fully integrated analytics stack with a mature set of analytics tools, covering all common uses cases and leveraging open source and standard languages, engines, and platforms, and
4/ performance, the most scalability, and the lowest cost for analytics.
analyze in a variety of ways with different engines
go beyond insights, from operational reporting on historical data, to being able to perform ML and real-time analytics => accurately predict future outcomes.
S3 to provide even more insight without the delays and cost from moving or transforming your data.
Many mature companies with dark data, to startup companies running real time applications built on what they learn
(How to Build a Data Lake.pptx)
Spend more time here, ask what people are using
Natively supported by big data frameworks (Spark, Hive, Presto, and others)
Decouple storage and compute
No need to run compute clusters for storage (unlike HDFS)
Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances
Multiple & heterogeneous analysis clusters and services can use the same data
Designed for 99.999999999% durability
No need to pay for data replication within a region
Secure – SSL, client/server-side encryption at rest
Encryptable
Hive compatible
Mini-ETL with Create Table As Statement, Views, Workgroups, query JSON, catalog upgrades
Who doesn’t love a good pie chart.
I think that “other” might be designing their data storage structure.
As someone helping solve issues in customer datasets, I can tell you some people need to spend more time defining their partition structure and data generation size
Mention security here how governments + medical companies are using it
There is a slide coming, detail in there, but just mention the security here
Many exists, but still not simple enough.
Manual and time-consuming tasks such as loading data from diverse sources, monitoring these data flows, setting up partitions, turning on encryption and managing keys, re-organizing data into columnar format, and granting and auditing access.
days, not months.
enables secured self-service discovery and access for users
Aware of multiple analytics services,
easy on-demand access to specific resources that fit the processor and memory requirements of each analytics workload.
The data is curated and cataloged, already prepared for any flavor of analytics, and related records are matched and de-duplicated with machine learning.
Automation reduces the time it takes to get to answers when your data lake is built on top of AWS
Lake Formation simplifies this manual process, automates many of the steps, allowing customers to setup a Data Lake just a few clicks from a single, unified dashboard. This reduces the time to setup a Data Lake from months to days.
To eliminate siloes, you need to build a data lake
automates many of the complex steps required to set up a data lake, reducing the time required to build a secure data lake from months to days.
Security control at the object level for our object storage (data lake storage) layer. Other cloud vendors only provide bucket level security control.
Deep integration across services that are needed to get answers from your data, including storage, compute, networking, and data movement. For example, Amazon EMR makes it easy to use EC2 Spot instances to save up to 90% on analytic workloads. Amazon Redshift allows you to query your S3 objects directly from your data warehouse.
A single security model across all analytic services. AWS Lake Formation provides a single way to control access to your data whether you are accessing that data from a data warehouse, a Spark cluster, or a serverless query technology.
Mature analytics services. Amazon EMR was first released in 2009 and Amazon Redshift first launched in 2013. Amazon S3 was one of the first AWS products and has been available since 2006. Tens of thousands of customers have data lakes on AWS and X exabytes of data is analyzed every day.
A single object storage layer that is compatible with all AWS analytics and machine learning services. Amazon S3 is our only object storage service, we do not have different versions of S3 and we do not have separate “data lake storage.”
5 storage tiers and intelligent tiering in Amazon S3, so are able to store more data at a lower cost and with less manual data lifecycle work than with any other cloud provider.
AWS S3 and AWS managed services store customer data in independent data centers across three availability zones within a single AWS Region and automatically replicate data between any regions regardless of storage class, providing a very high degree of fault tolerance and data durability out-of-the-box.
AWS analytics services provide best of breed performance. Amazon Redshift is 2x faster than the next most popular competitor and Amazon EMR runs Apache Spark workloads over 10x faster than open source Spark. Speed helps get to answers quickly and also helps keep costs down for complex analytics.
AWS Lake Formation has an enhanced Data Catalog to enable users to record more metadata and Tags at Databases, Tables and Columns. All this data will be searchable.
Good time to break?
Database, table, column
IAM is API based, but it isn’t designed for real-world access control.
On Glue Catalog we can grant resource level permissions, again, it is API based and doesn’t give granular enough access.
Register location till file level.
Need to give IAM role that has access to that location and trust Lake Formation
We update SLR policy. By default has permissions to List bucket, but as you register locations, we add additional permissions
Data lake admin not the IAM Admin by default. Keep separate for security.
Data lake will not ignore IAM. Need to add themselves as DataLake admin.
In LF you would give permissions to a table, without having to give access to the bucket.
Glue customer but not lake-formation customer, will have full access. Until location registered.
Tells Athena, Redshift and EMR to check LF when querying.
By default, no one has access. Grant permissions to any IAM users/roles.
All permissions are on Catalog objects. After registering, Athena won’t work.
Root is denied, shouldn’t access. Allowed on users/roles
Takes in principle (user/role), then resource (db/table/column).
Table permissions grant/deny. Similar to databases, but differences.
Grant permissions are permissions to give permissions, for example for managers. Best practice for design.
Example: Athena query:
get-table
get-temporary-credentials
We don’t use
Grant permissions on table, must specify database as well.
Athena/redshift can do column level filtering. Glue can’t.
Column level support.
Encryption on catalog
Service side security
Relationship advice
Who here is building ETL with Glue?
Who uses Crawler?
Who uses Workflow?
1/ integrated
natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines, databases EC2.
2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
3/ Mover Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
Console only feature
Why no S3 to S3?
Endpoint is more secure, less latency.
Examples:
AWS Lake Formation includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
1/ Less Hassle: AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.
2/ Cost Effective / Serverless: AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
3/ Mower Power: AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
Go read the docs!!
STORY BACKGROUND
Georgia-Pacific, owned by Koch Industries, is an American wood products, pulp, and paper company based in Atlanta, Georgia. The organization is one of the world’s largest manufacturers and distributors of pulp, towel and tissue paper and dispensers, packaging, and wood and gypsum building products.
They use an S3 data lake as part of an advanced analytics and ML solution to gain new insights, optimize processes, and maximize resources.
They now save millions annually by leveraging new insights to improve equipment failure predictions, run more production lines efficiently, and ensure high quality products.
https://aws.amazon.com/solutions/case-studies/georgia-pacific/
In the first six months, Georgia-Pacific transferred about 50 TB of production data—more than 500 billion records—from hundreds of large, complex manufacturing and converting-process machines. The company uses Amazon Kinesis to stream real-time data from manufacturing equipment to a central data lake based on Amazon Simple Storage Service (Amazon S3), allowing it to efficiently ingest and analyze structured and unstructured data at scale.
Georgia-Pacific knew it could learn from its structured and unstructured data, but the company lacked a cost-effective storage mechanism to ingest, transform, house, and analyze this data.
Georgia-Pacific uses Amazon Elastic MapReduce (Amazon EMR) to transform the data before delivering it in a structured fashion to data analysts through Amazon Redshift. The analysts use Amazon Athena on top of Amazon S3 to query the raw data, which includes information on pulping mechanisms, paper machines, converting lines, vibration trends, throughput, and paper quality.
Georgia-Pacific also uses Amazon SageMaker, an AWS machine-learning (ML) solution, to build, train, and deploy ML models at scale. Using ML models built with raw production data, Amazon SageMaker provides real-time feedback to machine operators regarding optimum machine speeds and other adjustable variables, enabling less experienced operators to detect breaks earlier and maintain quality.