SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Downloaden Sie, um offline zu lesen
November 13, 2014 | Las Vegas | NV 
Roy Ben-Alta (Big Data Analytics Practice Lead, AWS) Thomas Barthelemy (Software Engineer, Coursera)
Amazon 
Redshift 
Amazon 
EMR 
Amazon 
EC2 
Process & Analyze 
Store 
AWS Direct 
Connect 
Amazon 
S3 
Amazon 
Kinesis 
Amazon 
Glacier 
AWS 
Import/Export 
DynamoDB 
Collect
Billing 
DB 
CSV files 
CRMDB 
Extract 
Transform 
Load 
DWH 
DB 
DB 
DB 
ETL
Amazon 
Redshift 
Amazon 
RDS 
Amazon S3 as 
Data Lake 
Amazon 
DynamoDB 
Amazon S3 as Amazon EMR 
Landing Zone 
AWS Data Pipeline 
Data 
Feeds 
Amazon EC2
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp- program-pipeline.html
Amazon S3 as Data 
Lake/Landing Zone 
Amazon EMR 
as ETL Grid 
Amazon Redshift – 
Production DWH 
Visualization 
Logs
Define 
aws datapipeline create-pipeline –name myETL--unique-id token 
--output: df-09222142C63VXJU0HC0A 
Import 
aws datapipeline put-pipeline-definition --pipeline-id df- 09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json 
Activate 
aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
copy weblogs 
from 's3://staging/' 
credentials 'aws_access_key_id=<my access key>;aws_secret_access_key=<my secret key>' 
delimiter ','
Coursera Big Data Analytics Powered by AWS 
Thomas Barthelemy 
SWE, Data Infrastructure 
thomas@coursera.org
Overview 
●About Coursera 
●Phase 1: Consolidate data 
●Phase 2: Get users hooked 
●Phase 3: Increase reliability 
●Looking forward
Coursera at a Glance
About Coursera 
●Platform for Massive Open Online Courses 
●Universities create the content 
●Content free to the public
Coursera Stats 
●~10 million learners 
●+110 university partners 
●>200 courses open now 
●~170 employees
The value of data at Coursera 
●Making strategic and tactical decisions 
●Studying pedagogy
Becoming More Data-Driven 
●Since the early days, Coursera has understood the value of data 
oFounders came from machine learning 
oMany of the early employees researched with the founders 
●Cost of data access was high 
oeach analysis requires extraction, pre-processing 
odata only available to data scientists + engineers
Phase 1: 
Consolidate data
Sources 
●MySQL 
oSite data 
oCourse data sharded across multiple database 
●Cassandra increasingly used for course data 
●Logged event data 
●External APIs 
Classes 
Classes 
SQL 
Sharded SQL 
NoSQL 
Logs 
External APIs
(Obligatory meme)
What platforms to use? 
●Amazon Redshifthad glowing recommendations 
●AWS Data Pipelinehas native support for various Amazon services
ETL development was slow :( 
●Slow to do the following in the console: 
ocreate one pipeline 
ocreate similar pipelines 
oupdate existing pipelines
Solution: Programmatically create pipelines 
●Break ETL into reusable steps 
oExtract from variety of sources 
oTransform data with Amazon EMR, Amazon EC2, or within Amazon Redshift 
oLoad data principally into Amazon Redshift 
●Use Amazon S3 as intermediate state for Amazon EMR-or Amazon EC2-based transformations 
●Map stepsto set of AWS Data Pipeline objects
Example step: extract-rds 
●Parameters: hostname, database name, SQL 
●Creates pipeline objects: 
S3 Node can be used by many other step types
Example load definition 
steps: 
-type:extract-from-rds 
sql:| SELECT instructor_id, course_id, rank 
FROM courses_instructorincourse; 
hostname:maestro-read-replica 
database:maestro 
-type:load-into-staging-table 
table: staging.maestro_instructors_sessions 
-type:reload-prod-table 
source:staging.maestro_instructors_sessions 
destination:prod.instructors_sessions 
Extract data from Amazon RDS 
Load intermediate table in Amazon Redshift 
Reload target table with new data
ETL –Amazon RDS 
SQL 
AmazonS3 
AmazonRedshift 
Extract 
Load
ETL –Sharded RDS 
AmazonS3 
AmazonS3 
AmazonRedshift 
SQL 
AmazonS3 
AmazonEMR 
AmazonS3 
AmazonEMR 
AmazonEMR 
Transform 
Extract 
Load
ETL –Logs 
Logs 
Amazon EC2 
Amazon S3 
Amazon S3 
AmazonEMR 
AmazonRedshift 
Amazon S3 
Transform 
Load 
Extract
Reporting Model, Dec 2013
Reporting Model, Sep 2014
AWS Data Pipeline 
●Easily handles starting/stopping of resources 
●Handles permissions, roles 
●Integrates with other AWS services 
●Handles “flow” of data, data dependencies
Dealing with large pipelines 
●Monolithic pipelines hard to maintain 
●Moved to making pipelines smaller 
oHooray modularity! 
●If pipeline B depended on pipeline A, just schedule it later 
oAdd a time buffer just to be safe
Setting cross-pipeline dependencies 
●Dependencies accomplished using a script that would wait until dependencies finished 
oShellCommandActivityto the rescue
The beauty of ShellCommandActivity 
●You can use it anywhere 
oAccomplish tasks that have no corresponding activity type 
oOverride native AWS Data Pipeline support if it does not meet your needs
ETL library 
●Install on machine as first step of each pipeline 
oWith ShellCommandActivity 
●Allows for even more modularity
Phase 2: 
Get users hooked
We have data. Now what? 
●Simply collecting data will not make a company data-driven 
●First step: make data easier for the analysts to use
Certifying a version of the truth 
●Data Infrastructure team creates a 3NF model of the data 
●Model is designed to be as interpretable as possible
Example: Forums 
●Full-word names for database objects 
●Join keys made obvious through universally- unique column names 
oe.g. session_id joins to session_id 
●Auto-generated ER-diagram, so documentation is up-to-date
Some power users need data faster! 
●Added new schemas for developers to load data 
●Help us to remain data driven during early phases of… 
oProduct release 
oAnalyzing new data
Lightweight SQL interface 
●Access via browser 
oUsers don’t need database credentials or special software 
●Data exportable as CSV 
●Auto-complete for DB object names 
●Took ~2 days to make
Lightweight SQL interface
But what about people who don’t know SQL? 
●Reporting tools to the rescue! 
●In-house dashboard framework to support: 
oSupport 
oGrowth 
oData Infrastructure 
oMany other teams... 
●Tableau for interactive analysis
But what about people who don’t know SQL?
Phase 3: 
Increase reliability
If you build it, they will come 
●Built ecosystem of data consumers 
oAnalysts 
oBusiness users 
oProduct team
Maintaining the ecosystem 
●As we accumulate consumers, need to invest effort in... 
oModel stability 
oData quality 
oData availability
Persist Amazon Redshift logs to keep track of usage 
●Leverage database credentials to understand usage patterns 
●Know whom to contact when 
oNeed to change the schema 
oNeed to add a new set of tables similar to an existing set (e.g. assessments)
Data Quality 
●Quality of source systems (GIGO) 
oEncourage source system to fix issues 
●Quality of transformations 
oResponsibility of the data infrastructure team
Quality Transformations 
●Automated QA checks as part of ETL 
oAgain, use ShellCommandActivity 
●Key factors 
oCounts 
oValue comparisons 
oModeling rules (primary keys)
Ensuring High Reliability 
●Step retries catch most minor hiccups 
●On-call system to recover failed pipelines 
●Persist DB logs to keep track of load times 
oDelayed loads can also alert on-call devs 
oBonus: users know how fresh the data is 
oBonus: can keep track of success rate 
●AWS very helpful in debugging issues
Looking Forward
Current bottlenecks 
●Complex ETL 
oHow do we easily ETL complex, nested structures? 
●Developer bottleneck 
oHow do we allow users to access raw data more quickly?
Amazon S3 as principal data store 
●Amazon S3 to store raw data as soon as it’s produced (data lake) 
●Amazon EMR to process data 
●Amazon Redshift to remain as analytic platform
Thank you!
Want to learn more? 
●github.com/coursera/dataduct 
●http://tech.coursera.org 
●thomas@coursera.org
http://blogs.aws.amazon.com/bigdata/ http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-schedules.html
http://bit.ly/awsevals

Weitere ähnliche Inhalte

Was ist angesagt?

Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationMatthew W. Bowers
 
Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Amazon Web Services
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best PracticesMatillion
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksDeep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Cathrine Wilhelmsen
 
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...Amazon Web Services Korea
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Web Services
 
Heterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCTHeterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCTAmazon Web Services
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019Amazon Web Services Korea
 

Was ist angesagt? (20)

Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...Designing security & governance via AWS Control Tower & Organizations - SEC30...
Designing security & governance via AWS Control Tower & Organizations - SEC30...
 
AWS Technical Essentials Day
AWS Technical Essentials DayAWS Technical Essentials Day
AWS Technical Essentials Day
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech TalksDeep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
Deep Dive on Amazon QuickSight - January 2017 AWS Online Tech Talks
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Azure 101
Azure 101Azure 101
Azure 101
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
Lessons Learned: Understanding Pipeline Pricing in Azure Data Factory and Azu...
 
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...
AWS 비용 효율화를 고려한 Reserved Instance + Savings Plan 옵션 - 박윤 어카운트 매니저 :: AWS Game...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Enterprise Applications on AWS
Enterprise Applications on AWSEnterprise Applications on AWS
Enterprise Applications on AWS
 
Heterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCTHeterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCT
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019
AWS를 활용한 글로벌 오피스 업무 환경 구축하기 - 류한진, 이랜드시스템스 :: AWS Summit Seoul 2019
 

Andere mochten auch

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012Amazon Web Services
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Serverjoeharris76
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesAmazon Web Services
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)Amazon Web Services
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesNiranjan Sarade
 
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookCluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookRandy Zwitch
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISMarc Leinbach
 
Aws meetup ssm
Aws meetup ssmAws meetup ssm
Aws meetup ssmAdam Book
 

Andere mochten auch (20)

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012BDT201 AWS Data Pipeline - AWS re: Invent 2012
BDT201 AWS Data Pipeline - AWS re: Invent 2012
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Migration to Redshift from SQL Server
Migration to Redshift from SQL ServerMigration to Redshift from SQL Server
Migration to Redshift from SQL Server
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Ruby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examplesRuby's Arrays and Hashes with examples
Ruby's Arrays and Hashes with examples
 
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython NotebookCluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
Cluster Computing for $0.27/hr using Amazon EC2 and IPython Notebook
 
Sei dmm-intro1
Sei dmm-intro1Sei dmm-intro1
Sei dmm-intro1
 
SQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSISSQL Server to Redshift Data Load Using SSIS
SQL Server to Redshift Data Load Using SSIS
 
Aws meetup ssm
Aws meetup ssmAws meetup ssm
Aws meetup ssm
 

Ähnlich wie (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901WeCloudData
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWSDatavail
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsCan Abacıgil
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon MLJulien SIMON
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 

Ähnlich wie (BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014 (20)

AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Windows on AWS
Windows on AWSWindows on AWS
Windows on AWS
 
Amazon Redshift For Data Analysts
Amazon Redshift For Data AnalystsAmazon Redshift For Data Analysts
Amazon Redshift For Data Analysts
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Building prediction models with Amazon Redshift and Amazon ML
Building prediction models with  Amazon Redshift and Amazon MLBuilding prediction models with  Amazon Redshift and Amazon ML
Building prediction models with Amazon Redshift and Amazon ML
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

  • 1. November 13, 2014 | Las Vegas | NV Roy Ben-Alta (Big Data Analytics Practice Lead, AWS) Thomas Barthelemy (Software Engineer, Coursera)
  • 2. Amazon Redshift Amazon EMR Amazon EC2 Process & Analyze Store AWS Direct Connect Amazon S3 Amazon Kinesis Amazon Glacier AWS Import/Export DynamoDB Collect
  • 3. Billing DB CSV files CRMDB Extract Transform Load DWH DB DB DB ETL
  • 4.
  • 5. Amazon Redshift Amazon RDS Amazon S3 as Data Lake Amazon DynamoDB Amazon S3 as Amazon EMR Landing Zone AWS Data Pipeline Data Feeds Amazon EC2
  • 6.
  • 7.
  • 8.
  • 10. Amazon S3 as Data Lake/Landing Zone Amazon EMR as ETL Grid Amazon Redshift – Production DWH Visualization Logs
  • 11.
  • 12. Define aws datapipeline create-pipeline –name myETL--unique-id token --output: df-09222142C63VXJU0HC0A Import aws datapipeline put-pipeline-definition --pipeline-id df- 09222142C63VXJU0HC0A --pipeline-definition /home/repo/etl_reinvent.json Activate aws datapipeline activate-pipeline --pipeline-id df-09222142C63VXJU0HC0A
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. copy weblogs from 's3://staging/' credentials 'aws_access_key_id=<my access key>;aws_secret_access_key=<my secret key>' delimiter ','
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. Coursera Big Data Analytics Powered by AWS Thomas Barthelemy SWE, Data Infrastructure thomas@coursera.org
  • 29. Overview ●About Coursera ●Phase 1: Consolidate data ●Phase 2: Get users hooked ●Phase 3: Increase reliability ●Looking forward
  • 30. Coursera at a Glance
  • 31. About Coursera ●Platform for Massive Open Online Courses ●Universities create the content ●Content free to the public
  • 32. Coursera Stats ●~10 million learners ●+110 university partners ●>200 courses open now ●~170 employees
  • 33. The value of data at Coursera ●Making strategic and tactical decisions ●Studying pedagogy
  • 34. Becoming More Data-Driven ●Since the early days, Coursera has understood the value of data oFounders came from machine learning oMany of the early employees researched with the founders ●Cost of data access was high oeach analysis requires extraction, pre-processing odata only available to data scientists + engineers
  • 36. Sources ●MySQL oSite data oCourse data sharded across multiple database ●Cassandra increasingly used for course data ●Logged event data ●External APIs Classes Classes SQL Sharded SQL NoSQL Logs External APIs
  • 38. What platforms to use? ●Amazon Redshifthad glowing recommendations ●AWS Data Pipelinehas native support for various Amazon services
  • 39. ETL development was slow :( ●Slow to do the following in the console: ocreate one pipeline ocreate similar pipelines oupdate existing pipelines
  • 40. Solution: Programmatically create pipelines ●Break ETL into reusable steps oExtract from variety of sources oTransform data with Amazon EMR, Amazon EC2, or within Amazon Redshift oLoad data principally into Amazon Redshift ●Use Amazon S3 as intermediate state for Amazon EMR-or Amazon EC2-based transformations ●Map stepsto set of AWS Data Pipeline objects
  • 41. Example step: extract-rds ●Parameters: hostname, database name, SQL ●Creates pipeline objects: S3 Node can be used by many other step types
  • 42. Example load definition steps: -type:extract-from-rds sql:| SELECT instructor_id, course_id, rank FROM courses_instructorincourse; hostname:maestro-read-replica database:maestro -type:load-into-staging-table table: staging.maestro_instructors_sessions -type:reload-prod-table source:staging.maestro_instructors_sessions destination:prod.instructors_sessions Extract data from Amazon RDS Load intermediate table in Amazon Redshift Reload target table with new data
  • 43. ETL –Amazon RDS SQL AmazonS3 AmazonRedshift Extract Load
  • 44. ETL –Sharded RDS AmazonS3 AmazonS3 AmazonRedshift SQL AmazonS3 AmazonEMR AmazonS3 AmazonEMR AmazonEMR Transform Extract Load
  • 45. ETL –Logs Logs Amazon EC2 Amazon S3 Amazon S3 AmazonEMR AmazonRedshift Amazon S3 Transform Load Extract
  • 48. AWS Data Pipeline ●Easily handles starting/stopping of resources ●Handles permissions, roles ●Integrates with other AWS services ●Handles “flow” of data, data dependencies
  • 49. Dealing with large pipelines ●Monolithic pipelines hard to maintain ●Moved to making pipelines smaller oHooray modularity! ●If pipeline B depended on pipeline A, just schedule it later oAdd a time buffer just to be safe
  • 50. Setting cross-pipeline dependencies ●Dependencies accomplished using a script that would wait until dependencies finished oShellCommandActivityto the rescue
  • 51. The beauty of ShellCommandActivity ●You can use it anywhere oAccomplish tasks that have no corresponding activity type oOverride native AWS Data Pipeline support if it does not meet your needs
  • 52. ETL library ●Install on machine as first step of each pipeline oWith ShellCommandActivity ●Allows for even more modularity
  • 53. Phase 2: Get users hooked
  • 54. We have data. Now what? ●Simply collecting data will not make a company data-driven ●First step: make data easier for the analysts to use
  • 55. Certifying a version of the truth ●Data Infrastructure team creates a 3NF model of the data ●Model is designed to be as interpretable as possible
  • 56. Example: Forums ●Full-word names for database objects ●Join keys made obvious through universally- unique column names oe.g. session_id joins to session_id ●Auto-generated ER-diagram, so documentation is up-to-date
  • 57. Some power users need data faster! ●Added new schemas for developers to load data ●Help us to remain data driven during early phases of… oProduct release oAnalyzing new data
  • 58. Lightweight SQL interface ●Access via browser oUsers don’t need database credentials or special software ●Data exportable as CSV ●Auto-complete for DB object names ●Took ~2 days to make
  • 60. But what about people who don’t know SQL? ●Reporting tools to the rescue! ●In-house dashboard framework to support: oSupport oGrowth oData Infrastructure oMany other teams... ●Tableau for interactive analysis
  • 61. But what about people who don’t know SQL?
  • 62. Phase 3: Increase reliability
  • 63. If you build it, they will come ●Built ecosystem of data consumers oAnalysts oBusiness users oProduct team
  • 64. Maintaining the ecosystem ●As we accumulate consumers, need to invest effort in... oModel stability oData quality oData availability
  • 65. Persist Amazon Redshift logs to keep track of usage ●Leverage database credentials to understand usage patterns ●Know whom to contact when oNeed to change the schema oNeed to add a new set of tables similar to an existing set (e.g. assessments)
  • 66. Data Quality ●Quality of source systems (GIGO) oEncourage source system to fix issues ●Quality of transformations oResponsibility of the data infrastructure team
  • 67. Quality Transformations ●Automated QA checks as part of ETL oAgain, use ShellCommandActivity ●Key factors oCounts oValue comparisons oModeling rules (primary keys)
  • 68. Ensuring High Reliability ●Step retries catch most minor hiccups ●On-call system to recover failed pipelines ●Persist DB logs to keep track of load times oDelayed loads can also alert on-call devs oBonus: users know how fresh the data is oBonus: can keep track of success rate ●AWS very helpful in debugging issues
  • 70. Current bottlenecks ●Complex ETL oHow do we easily ETL complex, nested structures? ●Developer bottleneck oHow do we allow users to access raw data more quickly?
  • 71. Amazon S3 as principal data store ●Amazon S3 to store raw data as soon as it’s produced (data lake) ●Amazon EMR to process data ●Amazon Redshift to remain as analytic platform
  • 73. Want to learn more? ●github.com/coursera/dataduct ●http://tech.coursera.org ●thomas@coursera.org
  • 74.