SlideShare ist ein Scribd-Unternehmen logo
1 von 13
ETL PIPELINE AND
JOINING LARGE
DATASETS
-
Harsha Tenneti
Contents
● ETL Pipeline
● Fault Tolerance
● Joins in Dataframe
● Problem statement
● Issues
● Steps to solve issues
ETL Pipeline
Data Manager
Ingestor Joiner
Wrangler Validator
Fault Tolerance
● All The modules are stateless, Data Manager gives job to all the modules.
● Data Manager holds the state of entire pipeline in Mysql
● Has timeouts to each job so that if it fails, then it will again start.
Joins
● Joins need the keys from each dataset to be in same partition.
● If both dataset’s doesn’t have same partitioner, then we need to shuffle the
data which makes sure same keys across dataset’s lies in same partitioner.
● Couple of Join strategies used in dataframe are sort merge and broadcast
joins.
Problem Statement
● Need to do a left outer Join of 12 datasets(A1…..A12) in which 10 datasets are
below 10mb size and 2 are between 25-30mb with a dataset(B) which is
around 50gb with approx 8 cores.
B.join(A1...A2, “left_outer”)
● After join, need to do a groupBy and then select a row from the group.
● All files are in Parquet format.
Issues
● We have to actually join one by one datasets (A1….A12) to B. So it’s actually 12
joins.
● After doing a groupBy, and working on the group to select a row will lead to
memory out of exception as a row is very huge.
Steps to solve issues
● Divide the large dataset B into chunks of 500mb and say the chunks are
(B1...Bn). This will make sure that we are joining and solving groupBy issue to a
500mb file at a time
● Sort each dataset from (B1...Bn) with the joinkeys which will make sure Unique
keys of Big data set reside in same partition.
● Join Each 500mb with other 12 datasets(A1...A12).
val joinedDF = allEventsDF.foldLeft(sortedBaseSourceDF)((x, y) => x.join(y._2,
getJoinColumnExpression(x, y._2, joinKeys, y._1), "left_outer"))
Contd...
● Now tasks is to do a groupBy on each 500mb chunked joined data.
● Now working on entire row giving us memory out exceptions, we added a
hashcode to the joined dataset and the selected the required columns along
with the hashCode.
● We do a map partition on the join dataset and take an iterator of 100 rows at a
time from each partition.
Contd...
● As we work on only 100 rows at a time, we do a aggregateByKey where it has
a combining stage which combines the same keys across 100 row chunks and
merging stage which combine the same keys across the partitions.
val allEventsResponseRDD = reqDF.mapPartitions(makingATuple).aggregateByKey(List[(Int, Row)]())((x, y)
=> (y._1, y._2) :: x, reduceListFunc)
● We join the actual resultant dataset with the actual join dataset with hashcol to
get all the other columns.
val allEventsResponseFullDF = rowWithHashDF.join(allEventsResponseDF, rowWithHashDF("hashCol")
===allEventsResponseDF("hashCol"), "inner").drop(allEventsResponseDF("hashCol"))
Contd...
● Now we get (c1….cn) resultant dataset as we have (B1….Bn) dataset’s of B.
● We do a union of all datasets c1….cn and get final dataset D.
Questions ??
Thank u

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Building a citizen sensor network in windows azure
Building a citizen sensor network in windows azureBuilding a citizen sensor network in windows azure
Building a citizen sensor network in windows azure
 
Axibase Time Series Database
Axibase Time Series DatabaseAxibase Time Series Database
Axibase Time Series Database
 
Graph computation
Graph computationGraph computation
Graph computation
 
WEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERSWEBSOCKETS AND WEBWORKERS
WEBSOCKETS AND WEBWORKERS
 
Angular js performance improvements
Angular js performance improvementsAngular js performance improvements
Angular js performance improvements
 
Equation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-sparkEquation solving-at-scale-using-apache-spark
Equation solving-at-scale-using-apache-spark
 
Building high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesosBuilding high scalable distributed framework on apache mesos
Building high scalable distributed framework on apache mesos
 
Failsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they workFailsafe Hadoop Infrastructure and the way they work
Failsafe Hadoop Infrastructure and the way they work
 
Productionizing spark
Productionizing sparkProductionizing spark
Productionizing spark
 
Real-time Supply Chain Analytics
Real-time Supply Chain AnalyticsReal-time Supply Chain Analytics
Real-time Supply Chain Analytics
 
Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)Sparkstreaming with kafka and h base at scale (1)
Sparkstreaming with kafka and h base at scale (1)
 
Spark and spark streaming internals
Spark and spark streaming internalsSpark and spark streaming internals
Spark and spark streaming internals
 
Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
Introduction to apache nutch
Introduction to apache nutchIntroduction to apache nutch
Introduction to apache nutch
 
Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...Building bots to automate common developer tasks - Writing your first smart c...
Building bots to automate common developer tasks - Writing your first smart c...
 

Ähnlich wie Joining Large data at Scale

Ähnlich wie Joining Large data at Scale (20)

Report_NLNN
Report_NLNNReport_NLNN
Report_NLNN
 
Database Systems Assignment Help
Database Systems Assignment HelpDatabase Systems Assignment Help
Database Systems Assignment Help
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
Eye deep
Eye deepEye deep
Eye deep
 
sol43.pdf
sol43.pdfsol43.pdf
sol43.pdf
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Accelerated Logistic Regression on GPU(s)
Accelerated Logistic Regression on GPU(s)Accelerated Logistic Regression on GPU(s)
Accelerated Logistic Regression on GPU(s)
 
Electrical Engineering Exam Help
Electrical Engineering Exam HelpElectrical Engineering Exam Help
Electrical Engineering Exam Help
 
Optimizing array-based data structures to the limit
Optimizing array-based data structures to the limitOptimizing array-based data structures to the limit
Optimizing array-based data structures to the limit
 
初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務初探AWS 平台上的 NoSQL 雲端資料庫服務
初探AWS 平台上的 NoSQL 雲端資料庫服務
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
06 linked list
06 linked list06 linked list
06 linked list
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
Matlab-3.pptx
Matlab-3.pptxMatlab-3.pptx
Matlab-3.pptx
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Session2
Session2Session2
Session2
 
Architecture Assignment Help
Architecture Assignment HelpArchitecture Assignment Help
Architecture Assignment Help
 

Mehr von Sigmoid

Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
Sigmoid
 

Mehr von Sigmoid (12)

Monitoring and tuning Spark applications
Monitoring and tuning Spark applicationsMonitoring and tuning Spark applications
Monitoring and tuning Spark applications
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
Real-Time Stock Market Analysis using Spark Streaming
 Real-Time Stock Market Analysis using Spark Streaming Real-Time Stock Market Analysis using Spark Streaming
Real-Time Stock Market Analysis using Spark Streaming
 
Levelling up in Akka
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
 
Expression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutionsExpression Problem: Discussing the problems in OOPs language & their solutions
Expression Problem: Discussing the problems in OOPs language & their solutions
 
Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0Spark 1.6 vs Spark 2.0
Spark 1.6 vs Spark 2.0
 
SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0SORT & JOIN IN SPARK 2.0
SORT & JOIN IN SPARK 2.0
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Dashboard design By Anu Vijayan
Dashboard design By Anu VijayanDashboard design By Anu Vijayan
Dashboard design By Anu Vijayan
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Joining Large data at Scale

  • 1. ETL PIPELINE AND JOINING LARGE DATASETS - Harsha Tenneti
  • 2. Contents ● ETL Pipeline ● Fault Tolerance ● Joins in Dataframe ● Problem statement ● Issues ● Steps to solve issues
  • 3. ETL Pipeline Data Manager Ingestor Joiner Wrangler Validator
  • 4. Fault Tolerance ● All The modules are stateless, Data Manager gives job to all the modules. ● Data Manager holds the state of entire pipeline in Mysql ● Has timeouts to each job so that if it fails, then it will again start.
  • 5. Joins ● Joins need the keys from each dataset to be in same partition. ● If both dataset’s doesn’t have same partitioner, then we need to shuffle the data which makes sure same keys across dataset’s lies in same partitioner. ● Couple of Join strategies used in dataframe are sort merge and broadcast joins.
  • 6. Problem Statement ● Need to do a left outer Join of 12 datasets(A1…..A12) in which 10 datasets are below 10mb size and 2 are between 25-30mb with a dataset(B) which is around 50gb with approx 8 cores. B.join(A1...A2, “left_outer”) ● After join, need to do a groupBy and then select a row from the group. ● All files are in Parquet format.
  • 7. Issues ● We have to actually join one by one datasets (A1….A12) to B. So it’s actually 12 joins. ● After doing a groupBy, and working on the group to select a row will lead to memory out of exception as a row is very huge.
  • 8. Steps to solve issues ● Divide the large dataset B into chunks of 500mb and say the chunks are (B1...Bn). This will make sure that we are joining and solving groupBy issue to a 500mb file at a time ● Sort each dataset from (B1...Bn) with the joinkeys which will make sure Unique keys of Big data set reside in same partition. ● Join Each 500mb with other 12 datasets(A1...A12). val joinedDF = allEventsDF.foldLeft(sortedBaseSourceDF)((x, y) => x.join(y._2, getJoinColumnExpression(x, y._2, joinKeys, y._1), "left_outer"))
  • 9. Contd... ● Now tasks is to do a groupBy on each 500mb chunked joined data. ● Now working on entire row giving us memory out exceptions, we added a hashcode to the joined dataset and the selected the required columns along with the hashCode. ● We do a map partition on the join dataset and take an iterator of 100 rows at a time from each partition.
  • 10. Contd... ● As we work on only 100 rows at a time, we do a aggregateByKey where it has a combining stage which combines the same keys across 100 row chunks and merging stage which combine the same keys across the partitions. val allEventsResponseRDD = reqDF.mapPartitions(makingATuple).aggregateByKey(List[(Int, Row)]())((x, y) => (y._1, y._2) :: x, reduceListFunc) ● We join the actual resultant dataset with the actual join dataset with hashcol to get all the other columns. val allEventsResponseFullDF = rowWithHashDF.join(allEventsResponseDF, rowWithHashDF("hashCol") ===allEventsResponseDF("hashCol"), "inner").drop(allEventsResponseDF("hashCol"))
  • 11. Contd... ● Now we get (c1….cn) resultant dataset as we have (B1….Bn) dataset’s of B. ● We do a union of all datasets c1….cn and get final dataset D.