SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sourabh Bajaj, Software Engineer, Coursera
October 2015
BDT404
Large-Scale ETL Data Flows
With Data Pipeline and Dataduct
What to Expect from the Session
● Learn about:
• How we use AWS Data Pipeline to manage ETL at Coursera
• Why we built Dataduct, an open source framework from
Coursera for running pipelines
• How Dataduct enables developers to write their own ETL
pipelines for their services
• Best practices for managing pipelines
• How to start using Dataduct for your own pipelines
Coursera
Coursera
Coursera
120
partners
2.5 million
course completions
Education at Scale
15 million
learners worldwide
1300
courses
Data Warehousing at Coursera
Amazon Redshift
167 Amazon Redshift users
1200 EDW tables
22 source systems
6 dc1.8xlarge instances
30,000,000 queries run
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
Data Flow
Amazon
Redshift
Amazon
RDS
Amazon EMR Amazon S3
Event
Feeds
Amazon EC2
Amazon
RDS
Amazon S3
BI Applications
Third Party
Tools
Cassandra
Cassandra
AWS Data Pipeline
ETL at Coursera
150 Active pipelines 44 Dataduct developers
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Requirements for an ETL system
Fault Tolerance Scheduling Dependency
Management
Resource Management Monitoring Easy Development
Dataduct
● Open source wrapper around AWS Data Pipeline
Dataduct
Dataduct
● Open source wrapper around AWS Data Pipeline
● It provides:
• Code reuse
• Extensibility
• Command line interface
• Staging environment support
• Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Let’s build some pipelines
Pipeline 1: Amazon RDS → Amazon Redshift
● Let’s start with a simple pipeline of pulling data from a
relational store to Amazon Redshift
Amazon
Redshift
Amazon
RDS
Amazon S3
Amazon EC2
AWS Data Pipeline
Pipeline 1: Amazon RDS → Amazon Redshift
● Definition in YAML
● Steps
● Shared Config
● Visualization
● Overrides
● Reusable code
Pipeline 1: Amazon RDS → Amazon Redshift
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Extract RDS
• Fetch data from Amazon RDS and output to Amazon S3
Pipeline 1: Amazon RDS → Amazon Redshift
(Steps)
● Create-Load-Redshift
• Create table if it doesn’t exist and load data using COPY.
Pipeline 1: Amazon RDS → Amazon Redshift
● Upsert
• Update and insert into the production table from staging.
Pipeline 1: Amazon RDS → Amazon Redshift
(Tasks)
● Bootstrap
• Fully automated
• Fetch latest binaries from Amazon S3 for Amazon EC2 /
Amazon EMR
• Install any updated dependencies on the resource
• Make sure that the pipeline would run the latest version of code
Pipeline 1: Amazon RDS → Amazon Redshift
● Quality assurance
• Primary key violations in the warehouse.
• Dropped rows: By comparing the number of rows.
• Corrupted rows: By comparing a sample set of rows.
• Automatically done within UPSERT
Pipeline 1: Amazon RDS → Amazon Redshift
● Teardown
• Amazon SNS alerting for failed tasks
• Logging of task failures
• Monitoring
• Run times
• Retries
• Machine health
Pipeline 1: Amazon RDS → Amazon Redshift
(Config)
● Visualization
• Automatically generated
by Dataduct
• Allows easy debugging
Pipeline 1: Amazon RDS → Amazon Redshift
● Shared Config
• IAM roles
• AMI
• Security group
• Retries
• Custom steps
• Resource paths
Pipeline 1: Amazon RDS → Amazon Redshift
● Custom steps
• Open-sourced steps can easily be shared across multiple
pipelines
• You can also create new steps and add them using the config
Deploying a pipeline
● Command line interface for all operations
usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b]
pipeline_definitions
[pipeline_definitions ...]
Pipeline 2: Cassandra → Amazon Redshift
Amazon
Redshift
Amazon EMR
(Scalding)
Amazon S3
Amazon S3 Amazon EMR
(Aegisthus)
Cassandra
AWS Data Pipeline
● Shell command activity to rescue
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to S3
● Aegisthus to parse SSTables into Avro dumps
Pipeline 2: Cassandra → Amazon Redshift
● Shell command activity to rescue
● Priam backups of Cassandra to Amazon S3
● Aegisthus to parse SSTables into Avro dumps
● Scalding to process Aegisthus output
• Extend the base steps to create more patterns
Pipeline 2: Cassandra → Amazon Redshift
Pipeline 2: Cassandra → Amazon Redshift
● Custom steps
• Aegisthus
• Scalding
Pipeline 2: Cassandra → Amazon Redshift
● EMR-Config overrides the defaults
Pipeline 2: Cassandra → Amazon Redshift
● Multiple output nodes from transform step
Pipeline 2: Cassandra → Amazon Redshift
● Bootstrap
• Save every pipeline definition
• Fetching new jar for the Amazon EMR jobs
• Specify the same Hadoop / Hive metastore installation
Pipeline 2: Cassandra → Amazon Redshift
Data products
● We’ve talked about data into the warehouse
● Common pattern:
• Wait for dependencies
• Computation inside redshift to create derived tables
• Amazon EMR activities for more complex process
• Load back into MySQL / Cassandra
• Product feature queries MySQL / Cassandra
● Used in recommendations, dashboards, and search
Recommendations
● Objective
• Connecting the learner to right content
● Use cases:
• Recommendations email
• Course discovery
• Reactivation of the users
Recommendations
● Computation inside Amazon Redshift to create derived
tables for co-enrollments
● Amazon EMR job for model training
● Model file pushed to Amazon S3
● Prediction API uses the updated model file
● Contract between the prediction and the training layer is
via model definition.
Internal Dashboard
● Objective
• Serve internal dashboards to create a data driven culture
● Use Cases
• Key performance indicators for the company
• Track results for different A/B experiments
Internal Dashboard
● Do:
• Monitoring (run times, retries, deploys, query times)
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
Learnings
● Do:
• Monitoring (run times, retries, deploys, query times)
• Code should live in library instead of scripts being passed to
every pipeline
• Test environment in staging should mimic prod
• Shared library to democratize writing of ETL
• Using read-replicas and backups
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
Learnings
● Don’t:
• Same code passed to multiple pipelines as a script
• Non version controlled pipelines
• Really huge pipelines instead of modular small pipelines with
dependencies
• Not catching resource timeouts or load delays
Learnings
Dataduct
● Code reuse
● Extensibility
● Command line interface
● Staging environment support
● Dynamic updates
Dataduct
● Repository
• https://github.com/coursera/dataduct
● Documentation
• http://dataduct.readthedocs.org/en/latest/
● Installation
• pip install dataduct
Questions?
Also, we are hiring!
https://www.coursera.org/jobs
Remember to complete
your evaluations!
Thank you!
Also, we are hiring!
https://www.coursera.org/jobs
Sourabh Bajaj
sb2nov
@sb2nov

Weitere ähnliche Inhalte

Was ist angesagt?

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectKaufman Ng
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changesconfluent
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionjglobal
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamHai Lu
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkSpark Summit
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streamingdatamantra
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...confluent
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Evan Chan
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsLightbend
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsLightbend
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 

Was ist angesagt? (20)

Data Pipelines with Kafka Connect
Data Pipelines with Kafka ConnectData Pipelines with Kafka Connect
Data Pipelines with Kafka Connect
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Capture the Streams of Database Changes
Capture the Streams of Database ChangesCapture the Streams of Database Changes
Capture the Streams of Database Changes
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
 
Kafka and Spark Streaming
Kafka and Spark StreamingKafka and Spark Streaming
Kafka and Spark Streaming
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Revitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive StreamsRevitalizing Enterprise Integration with Reactive Streams
Revitalizing Enterprise Integration with Reactive Streams
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 

Andere mochten auch

Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Adri9C
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi Renan Ranzani
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompressionDemet Aksoy
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.Yumar Rondon
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoYumar Rondon
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)ishlive
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iiiYumar Rondon
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxHortonworks
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvvIndian dental academy
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantAkshay Rai
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit SystemMadan Mankotia
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 

Andere mochten auch (20)

AWS_Data_Pipeline
AWS_Data_PipelineAWS_Data_Pipeline
AWS_Data_Pipeline
 
Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia Scotland's castles Rafa Garcia
Scotland's castles Rafa Garcia
 
Apresentação Nativi
Apresentação Nativi Apresentação Nativi
Apresentação Nativi
 
HadoopCompression
HadoopCompressionHadoopCompression
HadoopCompression
 
Proyecto gerencia industrial iupsmpzo.
Proyecto gerencia industrial   iupsmpzo.Proyecto gerencia industrial   iupsmpzo.
Proyecto gerencia industrial iupsmpzo.
 
Artículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreoArtículo sobre consideraciones fundamentales del muestreo
Artículo sobre consideraciones fundamentales del muestreo
 
Nature
NatureNature
Nature
 
Phrases (1) (1)
Phrases (1) (1)Phrases (1) (1)
Phrases (1) (1)
 
Proceso de manufactura unidad iii
Proceso de manufactura unidad iiiProceso de manufactura unidad iii
Proceso de manufactura unidad iii
 
Git in real product
Git in real productGit in real product
Git in real product
 
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks SandboxGo Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
Go Zero to Big Data in 15 Minutes with the Hortonworks Sandbox
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Agile retrospective
Agile retrospectiveAgile retrospective
Agile retrospective
 
Impression techiques / implant dentistry course/ implant dentistry coursevvv
Impression techiques  / implant dentistry course/ implant dentistry coursevvvImpression techiques  / implant dentistry course/ implant dentistry coursevvv
Impression techiques / implant dentistry course/ implant dentistry coursevvv
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Hadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. ElephantHadoop & Spark Performance tuning using Dr. Elephant
Hadoop & Spark Performance tuning using Dr. Elephant
 
Choice Based Credit System
Choice Based Credit SystemChoice Based Credit System
Choice Based Credit System
 
Hadoop in Healthcare Systems
Hadoop in Healthcare SystemsHadoop in Healthcare Systems
Hadoop in Healthcare Systems
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

Ähnlich wie Large-Scale ETL Data Flows With Data Pipeline and Dataduct

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBAmazon Web Services
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...Amazon Web Services
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Thanh Nguyen
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsChris Munns
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersAdrian Hornsby
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaAmazon Web Services
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksAmazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...Amazon Web Services
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAmazon Web Services
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream ProcessingLuis Gonzalez
 

Ähnlich wie Large-Scale ETL Data Flows With Data Pipeline and Dataduct (20)

(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & Dataduct
 
Serverless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDBServerless Web Apps using API Gateway, Lambda and DynamoDB
Serverless Web Apps using API Gateway, Lambda and DynamoDB
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
AWS re:Invent 2016: JustGiving: Serverless Data Pipelines, Event-Driven ETL, ...
 
Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern Migrating Monolithic Applications with the Strangler Pattern
Migrating Monolithic Applications with the Strangler Pattern
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
muCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless ApplicationsmuCon 2017 - 12 Factor Serverless Applications
muCon 2017 - 12 Factor Serverless Applications
 
Managing Your Cloud Assets
Managing Your Cloud AssetsManaging Your Cloud Assets
Managing Your Cloud Assets
 
Journey Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million UsersJourney Towards Scaling Your Application to Million Users
Journey Towards Scaling Your Application to Million Users
 
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS LambdaReal-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
Real-time Data Processing with Amazon DynamoDB Streams and AWS Lambda
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksDeep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
Deep Dive on AWS Lambda - January 2017 AWS Online Tech Talks
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
AWS re:Invent 2016: Workshop: Using the Database Migration Service (DMS) for ...
 
AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722AWS July Webinar Series: Amazon redshift migration and load data 20150722
AWS July Webinar Series: Amazon redshift migration and load data 20150722
 
AWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS OracleAWS Webcast - Migrating to RDS Oracle
AWS Webcast - Migrating to RDS Oracle
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines,  API, Messaging and Stream ProcessingJustGiving – Serverless Data Pipelines,  API, Messaging and Stream Processing
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
 

Kürzlich hochgeladen

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 

Kürzlich hochgeladen (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 

Large-Scale ETL Data Flows With Data Pipeline and Dataduct

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sourabh Bajaj, Software Engineer, Coursera October 2015 BDT404 Large-Scale ETL Data Flows With Data Pipeline and Dataduct
  • 2. What to Expect from the Session ● Learn about: • How we use AWS Data Pipeline to manage ETL at Coursera • Why we built Dataduct, an open source framework from Coursera for running pipelines • How Dataduct enables developers to write their own ETL pipelines for their services • Best practices for managing pipelines • How to start using Dataduct for your own pipelines
  • 6. 120 partners 2.5 million course completions Education at Scale 15 million learners worldwide 1300 courses
  • 7. Data Warehousing at Coursera Amazon Redshift 167 Amazon Redshift users 1200 EDW tables 22 source systems 6 dc1.8xlarge instances 30,000,000 queries run
  • 8. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 9. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 10. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 11. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 12. Data Flow Amazon Redshift Amazon RDS Amazon EMR Amazon S3 Event Feeds Amazon EC2 Amazon RDS Amazon S3 BI Applications Third Party Tools Cassandra Cassandra AWS Data Pipeline
  • 13. ETL at Coursera 150 Active pipelines 44 Dataduct developers
  • 14. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 15. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 16. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 17. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 18. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 19. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 20. Requirements for an ETL system Fault Tolerance Scheduling Dependency Management Resource Management Monitoring Easy Development
  • 22. ● Open source wrapper around AWS Data Pipeline Dataduct
  • 23. Dataduct ● Open source wrapper around AWS Data Pipeline ● It provides: • Code reuse • Extensibility • Command line interface • Staging environment support • Dynamic updates
  • 24. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 25. Let’s build some pipelines
  • 26. Pipeline 1: Amazon RDS → Amazon Redshift ● Let’s start with a simple pipeline of pulling data from a relational store to Amazon Redshift Amazon Redshift Amazon RDS Amazon S3 Amazon EC2 AWS Data Pipeline
  • 27. Pipeline 1: Amazon RDS → Amazon Redshift
  • 28. ● Definition in YAML ● Steps ● Shared Config ● Visualization ● Overrides ● Reusable code Pipeline 1: Amazon RDS → Amazon Redshift
  • 29. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Extract RDS • Fetch data from Amazon RDS and output to Amazon S3
  • 30. Pipeline 1: Amazon RDS → Amazon Redshift (Steps) ● Create-Load-Redshift • Create table if it doesn’t exist and load data using COPY.
  • 31. Pipeline 1: Amazon RDS → Amazon Redshift ● Upsert • Update and insert into the production table from staging.
  • 32. Pipeline 1: Amazon RDS → Amazon Redshift (Tasks) ● Bootstrap • Fully automated • Fetch latest binaries from Amazon S3 for Amazon EC2 / Amazon EMR • Install any updated dependencies on the resource • Make sure that the pipeline would run the latest version of code
  • 33. Pipeline 1: Amazon RDS → Amazon Redshift ● Quality assurance • Primary key violations in the warehouse. • Dropped rows: By comparing the number of rows. • Corrupted rows: By comparing a sample set of rows. • Automatically done within UPSERT
  • 34. Pipeline 1: Amazon RDS → Amazon Redshift ● Teardown • Amazon SNS alerting for failed tasks • Logging of task failures • Monitoring • Run times • Retries • Machine health
  • 35. Pipeline 1: Amazon RDS → Amazon Redshift (Config) ● Visualization • Automatically generated by Dataduct • Allows easy debugging
  • 36. Pipeline 1: Amazon RDS → Amazon Redshift ● Shared Config • IAM roles • AMI • Security group • Retries • Custom steps • Resource paths
  • 37. Pipeline 1: Amazon RDS → Amazon Redshift ● Custom steps • Open-sourced steps can easily be shared across multiple pipelines • You can also create new steps and add them using the config
  • 38. Deploying a pipeline ● Command line interface for all operations usage: dataduct pipeline activate [-h] [-m MODE] [-f] [-t TIME_DELTA] [-b] pipeline_definitions [pipeline_definitions ...]
  • 39. Pipeline 2: Cassandra → Amazon Redshift Amazon Redshift Amazon EMR (Scalding) Amazon S3 Amazon S3 Amazon EMR (Aegisthus) Cassandra AWS Data Pipeline
  • 40. ● Shell command activity to rescue Pipeline 2: Cassandra → Amazon Redshift
  • 41. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 Pipeline 2: Cassandra → Amazon Redshift
  • 42. ● Shell command activity to rescue ● Priam backups of Cassandra to S3 ● Aegisthus to parse SSTables into Avro dumps Pipeline 2: Cassandra → Amazon Redshift
  • 43. ● Shell command activity to rescue ● Priam backups of Cassandra to Amazon S3 ● Aegisthus to parse SSTables into Avro dumps ● Scalding to process Aegisthus output • Extend the base steps to create more patterns Pipeline 2: Cassandra → Amazon Redshift
  • 44. Pipeline 2: Cassandra → Amazon Redshift
  • 45. ● Custom steps • Aegisthus • Scalding Pipeline 2: Cassandra → Amazon Redshift
  • 46. ● EMR-Config overrides the defaults Pipeline 2: Cassandra → Amazon Redshift
  • 47. ● Multiple output nodes from transform step Pipeline 2: Cassandra → Amazon Redshift
  • 48. ● Bootstrap • Save every pipeline definition • Fetching new jar for the Amazon EMR jobs • Specify the same Hadoop / Hive metastore installation Pipeline 2: Cassandra → Amazon Redshift
  • 49. Data products ● We’ve talked about data into the warehouse ● Common pattern: • Wait for dependencies • Computation inside redshift to create derived tables • Amazon EMR activities for more complex process • Load back into MySQL / Cassandra • Product feature queries MySQL / Cassandra ● Used in recommendations, dashboards, and search
  • 50. Recommendations ● Objective • Connecting the learner to right content ● Use cases: • Recommendations email • Course discovery • Reactivation of the users
  • 51. Recommendations ● Computation inside Amazon Redshift to create derived tables for co-enrollments ● Amazon EMR job for model training ● Model file pushed to Amazon S3 ● Prediction API uses the updated model file ● Contract between the prediction and the training layer is via model definition.
  • 52. Internal Dashboard ● Objective • Serve internal dashboards to create a data driven culture ● Use Cases • Key performance indicators for the company • Track results for different A/B experiments
  • 54. ● Do: • Monitoring (run times, retries, deploys, query times) Learnings
  • 55. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline Learnings
  • 56. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod Learnings
  • 57. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL Learnings
  • 58. ● Do: • Monitoring (run times, retries, deploys, query times) • Code should live in library instead of scripts being passed to every pipeline • Test environment in staging should mimic prod • Shared library to democratize writing of ETL • Using read-replicas and backups Learnings
  • 59. ● Don’t: • Same code passed to multiple pipelines as a script Learnings
  • 60. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines Learnings
  • 61. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies Learnings
  • 62. ● Don’t: • Same code passed to multiple pipelines as a script • Non version controlled pipelines • Really huge pipelines instead of modular small pipelines with dependencies • Not catching resource timeouts or load delays Learnings
  • 63. Dataduct ● Code reuse ● Extensibility ● Command line interface ● Staging environment support ● Dynamic updates
  • 64. Dataduct ● Repository • https://github.com/coursera/dataduct ● Documentation • http://dataduct.readthedocs.org/en/latest/ ● Installation • pip install dataduct
  • 65. Questions? Also, we are hiring! https://www.coursera.org/jobs
  • 67. Thank you! Also, we are hiring! https://www.coursera.org/jobs Sourabh Bajaj sb2nov @sb2nov