SlideShare ist ein Scribd-Unternehmen logo
1 von 24
AWS Big Data:
Presented By:
Jay Duff
jay@reluscloud.com
Big Data Practice Director
Your Fast Track to Data Lakes
Big Data Model Maturity Index
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis
Schmarzo, Bill. Big Data: Understanding How Data Powers Big
Business (Kindle Locations 292-295). Wiley. Kindle Edition.
Inward Focus External Focus
Seizing the Big Data
Opportunity
• Business Monitoring – How DID the business perform
• Business Insights – Discovering patterns,
correlations, influences
• Business Optimization – Embedding algorithms to
automatically adjust operations (or the customer
experience)
• Data Monetization – Leveraging (or enriching) your data
assets (or platform) for new revenue opportunities
• Business Metamorphosis – Ultimate goal of new
products in new markets
Business Monitoring
Business Insights
Business Optimization
Data Monetization
Business Metamorphosis
Business Insights - Discovering
• Data Lake –
• Raw but accessible, unfiltered, largely unstructured
• Reduced bias
• More difficult to consume
• Narrow audience
• Data Warehouse
• Structured, Optimized
• Defined Measures & Metrics
• Easy to consume (e.g. Daily Dashboard)
• Large audience
Getting Started
S3
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
 How to include data
 How to make the data valuable
for discovery
 Accessible
 Secure
 Cataloged
 Durable
Cost OptimizationPerformance
SecurityReliability Operational Excellence
Ref: AWS Well-Architected Framework, https://aws.amazon.com/whitepapers/
Data is Only as Valuable as
the Decisions it Enables –
Ion Stoica, RISE Lab
AWS Athena
• Eliminate ETL
• Eliminate
Database
• Query S3 Directly
• Auto Scale
S3
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
Athena Service
AWS Service, based on Presto, provides ability to
query data in many formats without a client cluster
S3 Ingest
• AWS Console
• CLI - $ aws s3 cp …
• SDK – embed into your existing application
• Sqoop (EMR) – RDBMS to S3
• AWS Kinesis Firehose – Streaming data to S3
• AWS Snowball
• AWS Direct Connect
S3Ingest
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Download Airline Data – CSV Format
AWS Console – S3, upload 6 data
files
To Ingest Into S3:
$ aws s3 cp . s3://scalawag.data/fl/ --recursive
1. Download fight data, 2016,
Jan, Feb, Mar, Apr, May, Jun
2. Unzip
3. Unique file names (per year &
month)
4. Copy to S3 (using CLI)
Athena: Creating a Table
• Create Database: AWS
• Table: flight
• Data location:
s3://scalawag.data/fl/
Building Create Table
Statement
Console will guide you through
the process
• Select CSV format
• Add columns
• Skip Partitions
• Run Query
Additional Formats:
• Apache Web Log
• CSV, TSV, Delimited Text
• JSON
• Parquet & ORC
Query
• Created the table in Hive
Metastore
• CSV
• External
• S3
• Athena – Hosted Cluster
• There is no Database
• Autoscaling
• 2.44 sec
• You pay per bytes scanned
Pricing
• 3 sec
• 104 MB
• $5 TB/scanned = $0.0005
• Partitioning by month –
1/6
• Compressed Columnar
Storage – 1/8
Simplified Data Lake
• Retrieved External Data
• Uploaded to the Data Lake via Command Line Interface (CLI)
• (One time) Defined the Table
• Queried Data (directly from S3)
• No ETL
• No Cluster
• No Database
• What about more complex use cases?
Common Challenges
• Unstructured Data may lead to:
• Ungoverned Chaos
• Unusable Data
• Disparate & Complex Tools (that are quickly changing)
• Enterprise Wide Collaboration
• Security
• Unified, Consistent
• Common Toolset
• Storage & Compute costs
Complex Storage Requirements
– Why S3
• Durable & Available
• High Performance & Scalable
• Easy To Use
• AWS SDKs
• Trigger Events – Notifications & Process Steps
• Integrated
• Encryption – Managed SSE, SSE-C, SSE-KMS
• Policies – Lifecycle, Encryption, Access, Backup
• Native connectivity to EMR, Redshift, DynamoDB, Elasticsearch
• Low Cost & Storage Cost is Decoupled From Compute
• Widely Adopted
Adding ETL Complexity
• Storage Formats
• Parquet, ORC
• Avro
• Partitioning
• By Date
• By Geography
• Deduplication
• Streaming
S3
EMR
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
Elastic Map Reduce Service
AWS Elastic Map Reduce (EMR)
• EMR is a managed cluster running frameworks such as Apache Hadoop or
Spark
• Hive – SQL based, Batch Oriented
• Query (not nearly as fast as Athena)
• Basic ETL operations
• Easy
• Spark
• Complex ETL
• Machine Learning
• Graph DB functionality
• SparkR
Jay Rebecca
Marissa Alex
Friend
Employer
Neighbor
Marketing
Promotion
Business Insights -> Discovery & Prediction
Additional Database Options
S3
EMR
ETL
Extraction
Transformation
Loading
DatabaseIngest Access
Amazon
DynamoDB
ElasticsearchAmazon
Aurora
Amazon
Redshift
Amazon
Athena / Presto
Cataloging The Data Lake
Make Your Metadata Available
• Versions
• Content
• Schema
• Names
• Layout
• Enumeration
• Origins
AWS Elasticsearch
AWS Relational
Database Service
End User Tools
• RDS, Athena & Redshift Connectivity
• JDBC
• ODBC
• Commercially Available Tools – Amazon
Marketplace
• AWS QuickSight
• Easy to Integrate
• Per User Per Month Pricing
• Super Fast, Parallel, In memory Calculation Engine
Expense
• S3 – 100 TB, 1 year, $27K
• Athena ($5/TB scanned)
• Parquet & ORC – provide compression
• And columnar retrieval (only the needed columns)
• Redshift (Storage or Compute Oriented Nodes), 5 TB
• Assuming 20% compression
• Storage Intensive - $15K/year
• Compute Intensive - $50-85K/year
• Spark Cluster (assume 6 hr/day, 6 R3.4xLarge worker nodes)
• $25K/year
Managing Costs
• S3 Storage (Infrequent Access, Reduced Redundancy)
• Storage Formats: Compression, Columnar Storage, Partitioning
• Complementary Use
• Athena & Redshift
• Spark & Redshift
• Kinesis Analytics
• Lambda: Server-less Tasks
• Reserved and Spot Instances
• Automated Processes & Transitory Resources
Amazon Builder’s Template
https://aws.amazon.com/answers/big-data/data-lake-solution/
• Better Access Control Policies
• Searchable Data Catalog
• API Access
• User Console
• Monitoring
Ready to deploy template
Summary
• S3
• Provides Excellent Storage Versatility
• Excellent For Data Lake Storage
• Athena Provides Quick Start
• Easy To Manage - No Server
• Cost Effective
• AWS Ecosystem Supports More Complex Solutions
• Integrated Authentication & Security
• Real Time Catalog Updates

Weitere ähnliche Inhalte

Was ist angesagt?

(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
Amazon Web Services
 

Was ist angesagt? (20)

Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3Supercharging the Value of Your Data with Amazon S3
Supercharging the Value of Your Data with Amazon S3
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
Choosing the Right Database for the Job: Relational, Cache, or NoSQL?
 
Amazon Kinesis Data Streams
Amazon Kinesis Data StreamsAmazon Kinesis Data Streams
Amazon Kinesis Data Streams
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at ScaleModern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
Taking the Performance of your Data Warehouse to the Next Level with Amazon R...
 
2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days2016 AWS Big Data Solution Days
2016 AWS Big Data Solution Days
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 

Ähnlich wie Best Practices for Building a Data Lake on AWS

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
Amazon Web Services
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
Amazon Web Services
 

Ähnlich wie Best Practices for Building a Data Lake on AWS (20)

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2Scalable Data Analytics - DevDay Austin 2017 Day 2
Scalable Data Analytics - DevDay Austin 2017 Day 2
 
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
使用 Amazon Athena 直接分析儲存於 S3 的巨量資料
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
What's New with Big Data Analytics
What's New with Big Data AnalyticsWhat's New with Big Data Analytics
What's New with Big Data Analytics
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
NEW LAUNCH! Intro to Amazon Athena. Easily analyze data in S3, using SQL.
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
 
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
Interactive Analytics on AWS - AWS Summit Tel Aviv 2017
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and QuicksightServerlesss Big Data Analytics with Amazon Athena and Quicksight
Serverlesss Big Data Analytics with Amazon Athena and Quicksight
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Kürzlich hochgeladen (20)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 

Best Practices for Building a Data Lake on AWS

  • 1. AWS Big Data: Presented By: Jay Duff jay@reluscloud.com Big Data Practice Director Your Fast Track to Data Lakes
  • 2. Big Data Model Maturity Index Business Monitoring Business Insights Business Optimization Data Monetization Business Metamorphosis Schmarzo, Bill. Big Data: Understanding How Data Powers Big Business (Kindle Locations 292-295). Wiley. Kindle Edition. Inward Focus External Focus
  • 3. Seizing the Big Data Opportunity • Business Monitoring – How DID the business perform • Business Insights – Discovering patterns, correlations, influences • Business Optimization – Embedding algorithms to automatically adjust operations (or the customer experience) • Data Monetization – Leveraging (or enriching) your data assets (or platform) for new revenue opportunities • Business Metamorphosis – Ultimate goal of new products in new markets Business Monitoring Business Insights Business Optimization Data Monetization Business Metamorphosis
  • 4. Business Insights - Discovering • Data Lake – • Raw but accessible, unfiltered, largely unstructured • Reduced bias • More difficult to consume • Narrow audience • Data Warehouse • Structured, Optimized • Defined Measures & Metrics • Easy to consume (e.g. Daily Dashboard) • Large audience
  • 5. Getting Started S3 ETL Extraction Transformation Loading DatabaseIngest Access  How to include data  How to make the data valuable for discovery  Accessible  Secure  Cataloged  Durable Cost OptimizationPerformance SecurityReliability Operational Excellence Ref: AWS Well-Architected Framework, https://aws.amazon.com/whitepapers/ Data is Only as Valuable as the Decisions it Enables – Ion Stoica, RISE Lab
  • 6. AWS Athena • Eliminate ETL • Eliminate Database • Query S3 Directly • Auto Scale S3 ETL Extraction Transformation Loading DatabaseIngest Access Athena Service AWS Service, based on Presto, provides ability to query data in many formats without a client cluster
  • 7. S3 Ingest • AWS Console • CLI - $ aws s3 cp … • SDK – embed into your existing application • Sqoop (EMR) – RDBMS to S3 • AWS Kinesis Firehose – Streaming data to S3 • AWS Snowball • AWS Direct Connect S3Ingest http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time Download Airline Data – CSV Format
  • 8. AWS Console – S3, upload 6 data files To Ingest Into S3: $ aws s3 cp . s3://scalawag.data/fl/ --recursive 1. Download fight data, 2016, Jan, Feb, Mar, Apr, May, Jun 2. Unzip 3. Unique file names (per year & month) 4. Copy to S3 (using CLI)
  • 9. Athena: Creating a Table • Create Database: AWS • Table: flight • Data location: s3://scalawag.data/fl/
  • 10. Building Create Table Statement Console will guide you through the process • Select CSV format • Add columns • Skip Partitions • Run Query Additional Formats: • Apache Web Log • CSV, TSV, Delimited Text • JSON • Parquet & ORC
  • 11. Query • Created the table in Hive Metastore • CSV • External • S3 • Athena – Hosted Cluster • There is no Database • Autoscaling • 2.44 sec • You pay per bytes scanned
  • 12. Pricing • 3 sec • 104 MB • $5 TB/scanned = $0.0005 • Partitioning by month – 1/6 • Compressed Columnar Storage – 1/8
  • 13. Simplified Data Lake • Retrieved External Data • Uploaded to the Data Lake via Command Line Interface (CLI) • (One time) Defined the Table • Queried Data (directly from S3) • No ETL • No Cluster • No Database • What about more complex use cases?
  • 14. Common Challenges • Unstructured Data may lead to: • Ungoverned Chaos • Unusable Data • Disparate & Complex Tools (that are quickly changing) • Enterprise Wide Collaboration • Security • Unified, Consistent • Common Toolset • Storage & Compute costs
  • 15. Complex Storage Requirements – Why S3 • Durable & Available • High Performance & Scalable • Easy To Use • AWS SDKs • Trigger Events – Notifications & Process Steps • Integrated • Encryption – Managed SSE, SSE-C, SSE-KMS • Policies – Lifecycle, Encryption, Access, Backup • Native connectivity to EMR, Redshift, DynamoDB, Elasticsearch • Low Cost & Storage Cost is Decoupled From Compute • Widely Adopted
  • 16. Adding ETL Complexity • Storage Formats • Parquet, ORC • Avro • Partitioning • By Date • By Geography • Deduplication • Streaming S3 EMR ETL Extraction Transformation Loading DatabaseIngest Access Elastic Map Reduce Service
  • 17. AWS Elastic Map Reduce (EMR) • EMR is a managed cluster running frameworks such as Apache Hadoop or Spark • Hive – SQL based, Batch Oriented • Query (not nearly as fast as Athena) • Basic ETL operations • Easy • Spark • Complex ETL • Machine Learning • Graph DB functionality • SparkR Jay Rebecca Marissa Alex Friend Employer Neighbor Marketing Promotion Business Insights -> Discovery & Prediction
  • 18. Additional Database Options S3 EMR ETL Extraction Transformation Loading DatabaseIngest Access Amazon DynamoDB ElasticsearchAmazon Aurora Amazon Redshift Amazon Athena / Presto
  • 19. Cataloging The Data Lake Make Your Metadata Available • Versions • Content • Schema • Names • Layout • Enumeration • Origins AWS Elasticsearch AWS Relational Database Service
  • 20. End User Tools • RDS, Athena & Redshift Connectivity • JDBC • ODBC • Commercially Available Tools – Amazon Marketplace • AWS QuickSight • Easy to Integrate • Per User Per Month Pricing • Super Fast, Parallel, In memory Calculation Engine
  • 21. Expense • S3 – 100 TB, 1 year, $27K • Athena ($5/TB scanned) • Parquet & ORC – provide compression • And columnar retrieval (only the needed columns) • Redshift (Storage or Compute Oriented Nodes), 5 TB • Assuming 20% compression • Storage Intensive - $15K/year • Compute Intensive - $50-85K/year • Spark Cluster (assume 6 hr/day, 6 R3.4xLarge worker nodes) • $25K/year
  • 22. Managing Costs • S3 Storage (Infrequent Access, Reduced Redundancy) • Storage Formats: Compression, Columnar Storage, Partitioning • Complementary Use • Athena & Redshift • Spark & Redshift • Kinesis Analytics • Lambda: Server-less Tasks • Reserved and Spot Instances • Automated Processes & Transitory Resources
  • 23. Amazon Builder’s Template https://aws.amazon.com/answers/big-data/data-lake-solution/ • Better Access Control Policies • Searchable Data Catalog • API Access • User Console • Monitoring Ready to deploy template
  • 24. Summary • S3 • Provides Excellent Storage Versatility • Excellent For Data Lake Storage • Athena Provides Quick Start • Easy To Manage - No Server • Cost Effective • AWS Ecosystem Supports More Complex Solutions • Integrated Authentication & Security • Real Time Catalog Updates