SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Design. Build. Data
Pipelines
Nature of our systems
API API
Components
Front-End
Aggregation
Pipeline
Database
Caches
Secrets
Proxies
Machine
Learning
Really, don’t
Deciding what to do
Systems Design in Data
Pipeline
*Focal Points
• Black boxes

• Data flow patterns

• Particularly important when you are designing for a
migration

• Data correctness requirements

• Resist the temptation to build the “ideal” system
* Might be different for you
REALISE THIS
IS THIS
You cannot change something if you don’t
understand how it works.
More IMPORTANTLY, you cannot change something
if you don’t understand why it works the way it
does.
- unknown
Team dynamics
• Know where the team is at

• Know where the team should be (roughly)

• How to effect changes by the team, effectively

• Training/Re-training

• Changing mindsets is hardest !
Arming the team
•Recognise that learning requires
time
•Recognise that applying the learnt
knowledge requires time
•Recognise that being effective at
applying knowledge requires time
There is NO perfect data
architecture
What you need now is going to be
different from what you need in
the future
Create a Culture of Learning &
Appetite for Adventure
This is really important
API : Model : Engine
•Proper abstraction to support both streaming and batching
•Decomposes pipeline into
•What
•Where
•When
•How
•Separate data processing from the underlying physical
implementation
Beam
* Read Google’s VLDB paper - see reference
Why Beam - Pipeline
decomposition
* source: https://data-artisans.com/blog/why-apache-beam
Why Beam - Programming
model
Source: https://data-artisans.com/blog/why-apache-beam
DSL <=> Beam pipeline
DSL <=> Beam pipeline
Data types
Patterns
Monads ∈ DSL
Monads ∈ DSL
Monad Transformers ∈ DSL
Compute. Scaling Compute.
Diverse Workloads
Data Architecture
What is Mesos
Read the technical paper ; see reference
Why Mesos - Part 1
• Our DSL’s scheduling logic is greatly simplified because we
don’t have to consider:
• Framework requirement
• Resource availability
• Organizational policies
• Global schedule of tasks
Why Mesos - Part 2
• Beam pipelines are scheduled by DSL 

• Developer focus on building Beam job(s);
jobs are stringed by DSL

• Developer is free from worrying about
where resources are - solved by Mesos
resource-offering framework.
All the architectural decisions should favour enabling the system to adapt to change
Observations
•There is NO perfect data architecture
•What you need now is going to be
different from what you need in
the future
•Build a team that adapts to
change; learning is key.
References
• Dataflow / Apache Beam - Eugene Kirpichov

• The Dataflow Model - Tyler Akidau, Sam Whittle et al

• MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al

• FlumeJava : Easy, Efficient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al

• Mesos - Matei Zahari et al

• Why Curiosity Matters - Harvard Business Review September 2018

• Spotify Scio - Spotify’s Scala API around Apache Beam

• Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala

• Verizon Quiver : A reasonable library for modelling multi-graphs in Scala

• Scala - The Scala Programming Language
References
• Apache Beam VLDB paper - Tyler Akidau et al @ Google

• Streaming 101

• Streaming 102

• Beam vs Spark

• Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion
Stoica et al

• Beam Comparison : Data Artisans

• Dataflow/Beam & Spark : A programming model comparison : Tyler et al @ Google

• Dataflow Pipeline Execution Parameters

Weitere ähnliche Inhalte

Was ist angesagt?

Basecamp Tutorial Summer 2011
Basecamp Tutorial Summer 2011Basecamp Tutorial Summer 2011
Basecamp Tutorial Summer 2011
dwestbrook
 

Was ist angesagt? (18)

Get Intelligent with Metabase
Get Intelligent with MetabaseGet Intelligent with Metabase
Get Intelligent with Metabase
 
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsRainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams
 
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...
 
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Machine Learning Startup
Machine Learning StartupMachine Learning Startup
Machine Learning Startup
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Intro to the Cloud
Intro to the CloudIntro to the Cloud
Intro to the Cloud
 
Machine Learning Using Cloud Services
Machine Learning Using Cloud ServicesMachine Learning Using Cloud Services
Machine Learning Using Cloud Services
 
Machine learning systems for engineers
Machine learning systems for engineersMachine learning systems for engineers
Machine learning systems for engineers
 
DrupalCon Austin: Planning for Performance
DrupalCon Austin: Planning for PerformanceDrupalCon Austin: Planning for Performance
DrupalCon Austin: Planning for Performance
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to hero
 
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders
 
Rubyslava beyond the_monolith
Rubyslava beyond the_monolithRubyslava beyond the_monolith
Rubyslava beyond the_monolith
 
Basecamp Tutorial Summer 2011
Basecamp Tutorial Summer 2011Basecamp Tutorial Summer 2011
Basecamp Tutorial Summer 2011
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
Introduction to the Data Grid
Introduction to the Data GridIntroduction to the Data Grid
Introduction to the Data Grid
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 

Ähnlich wie Building a modern data platform with scala, akka, apache beam

The final frontier
The final frontierThe final frontier
The final frontier
Terry Bunio
 

Ähnlich wie Building a modern data platform with scala, akka, apache beam (20)

Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Scaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHPScaling a High Traffic Web Application: Our Journey from Java to PHP
Scaling a High Traffic Web Application: Our Journey from Java to PHP
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web Applications
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Software Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuableSoftware Architecture and Architectors: useless VS valuable
Software Architecture and Architectors: useless VS valuable
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
The final frontier
The final frontierThe final frontier
The final frontier
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

Mehr von Raymond Tay

Mehr von Raymond Tay (8)

Principled io in_scala_2019_distribution
Principled io in_scala_2019_distributionPrincipled io in_scala_2019_distribution
Principled io in_scala_2019_distribution
 
Practical cats
Practical catsPractical cats
Practical cats
 
Toying with spark
Toying with sparkToying with spark
Toying with spark
 
Distributed computing for new bloods
Distributed computing for new bloodsDistributed computing for new bloods
Distributed computing for new bloods
 
Functional programming with_scala
Functional programming with_scalaFunctional programming with_scala
Functional programming with_scala
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Introduction to Erlang
Introduction to ErlangIntroduction to Erlang
Introduction to Erlang
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 

Kürzlich hochgeladen

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Kürzlich hochgeladen (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Building a modern data platform with scala, akka, apache beam

  • 2. Nature of our systems API API
  • 4.
  • 5.
  • 7.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Systems Design in Data Pipeline
  • 15. *Focal Points • Black boxes • Data flow patterns • Particularly important when you are designing for a migration • Data correctness requirements • Resist the temptation to build the “ideal” system * Might be different for you
  • 17. You cannot change something if you don’t understand how it works. More IMPORTANTLY, you cannot change something if you don’t understand why it works the way it does. - unknown
  • 18. Team dynamics • Know where the team is at • Know where the team should be (roughly) • How to effect changes by the team, effectively • Training/Re-training • Changing mindsets is hardest !
  • 19. Arming the team •Recognise that learning requires time •Recognise that applying the learnt knowledge requires time •Recognise that being effective at applying knowledge requires time
  • 20. There is NO perfect data architecture What you need now is going to be different from what you need in the future
  • 21.
  • 22. Create a Culture of Learning & Appetite for Adventure This is really important
  • 23.
  • 24.
  • 25. API : Model : Engine •Proper abstraction to support both streaming and batching •Decomposes pipeline into •What •Where •When •How •Separate data processing from the underlying physical implementation
  • 26. Beam * Read Google’s VLDB paper - see reference
  • 27. Why Beam - Pipeline decomposition * source: https://data-artisans.com/blog/why-apache-beam
  • 28. Why Beam - Programming model Source: https://data-artisans.com/blog/why-apache-beam
  • 29. DSL <=> Beam pipeline
  • 30. DSL <=> Beam pipeline
  • 38. What is Mesos Read the technical paper ; see reference
  • 39. Why Mesos - Part 1 • Our DSL’s scheduling logic is greatly simplified because we don’t have to consider: • Framework requirement • Resource availability • Organizational policies • Global schedule of tasks
  • 40. Why Mesos - Part 2 • Beam pipelines are scheduled by DSL • Developer focus on building Beam job(s); jobs are stringed by DSL • Developer is free from worrying about where resources are - solved by Mesos resource-offering framework. All the architectural decisions should favour enabling the system to adapt to change
  • 41.
  • 42. Observations •There is NO perfect data architecture •What you need now is going to be different from what you need in the future •Build a team that adapts to change; learning is key.
  • 43. References • Dataflow / Apache Beam - Eugene Kirpichov • The Dataflow Model - Tyler Akidau, Sam Whittle et al • MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al • FlumeJava : Easy, Efficient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al • Mesos - Matei Zahari et al • Why Curiosity Matters - Harvard Business Review September 2018 • Spotify Scio - Spotify’s Scala API around Apache Beam • Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala • Verizon Quiver : A reasonable library for modelling multi-graphs in Scala • Scala - The Scala Programming Language
  • 44. References • Apache Beam VLDB paper - Tyler Akidau et al @ Google • Streaming 101 • Streaming 102 • Beam vs Spark • Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion Stoica et al • Beam Comparison : Data Artisans • Dataflow/Beam & Spark : A programming model comparison : Tyler et al @ Google • Dataflow Pipeline Execution Parameters