Joining Large data at Scale

ETL PIPELINE AND
JOINING LARGE
DATASETS
-
Harsha Tenneti

Contents
● ETL Pipeline
● Fault Tolerance
● Joins in Dataframe
● Problem statement
● Issues
● Steps to solve issues

ETL Pipeline
Data Manager
Ingestor Joiner
Wrangler Validator

Fault Tolerance
● All The modules are stateless, Data Manager gives job to all the modules.
● Data Manager holds the state of entire pipeline in Mysql
● Has timeouts to each job so that if it fails, then it will again start.

Joins
● Joins need the keys from each dataset to be in same partition.
● If both dataset’s doesn’t have same partitioner, then we need to shuffle the
data which makes sure same keys across dataset’s lies in same partitioner.
● Couple of Join strategies used in dataframe are sort merge and broadcast
joins.

Problem Statement
● Need to do a left outer Join of 12 datasets(A1…..A12) in which 10 datasets are
below 10mb size and 2 are between 25-30mb with a dataset(B) which is
around 50gb with approx 8 cores.
B.join(A1...A2, “left_outer”)
● After join, need to do a groupBy and then select a row from the group.
● All files are in Parquet format.

Issues
● We have to actually join one by one datasets (A1….A12) to B. So it’s actually 12
joins.
● After doing a groupBy, and working on the group to select a row will lead to
memory out of exception as a row is very huge.

Steps to solve issues
● Divide the large dataset B into chunks of 500mb and say the chunks are
(B1...Bn). This will make sure that we are joining and solving groupBy issue to a
500mb file at a time
● Sort each dataset from (B1...Bn) with the joinkeys which will make sure Unique
keys of Big data set reside in same partition.
● Join Each 500mb with other 12 datasets(A1...A12).
val joinedDF = allEventsDF.foldLeft(sortedBaseSourceDF)((x, y) => x.join(y._2,
getJoinColumnExpression(x, y._2, joinKeys, y._1), "left_outer"))

Contd...
● Now tasks is to do a groupBy on each 500mb chunked joined data.
● Now working on entire row giving us memory out exceptions, we added a
hashcode to the joined dataset and the selected the required columns along
with the hashCode.
● We do a map partition on the join dataset and take an iterator of 100 rows at a
time from each partition.

Contd...
● As we work on only 100 rows at a time, we do a aggregateByKey where it has
a combining stage which combines the same keys across 100 row chunks and
merging stage which combine the same keys across the partitions.
val allEventsResponseRDD = reqDF.mapPartitions(makingATuple).aggregateByKey(List[(Int, Row)]())((x, y)
=> (y._1, y._2) :: x, reduceListFunc)
● We join the actual resultant dataset with the actual join dataset with hashcol to
get all the other columns.
val allEventsResponseFullDF = rowWithHashDF.join(allEventsResponseDF, rowWithHashDF("hashCol")
===allEventsResponseDF("hashCol"), "inner").drop(allEventsResponseDF("hashCol"))

Contd...
● Now we get (c1….cn) resultant dataset as we have (B1….Bn) dataset’s of B.
● We do a union of all datasets c1….cn and get final dataset D.

Joining Large data at Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Joining Large data at Scale

Ähnlich wie Joining Large data at Scale (20)

Mehr von Sigmoid

Mehr von Sigmoid (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Joining Large data at Scale