4. Fault Tolerance
● All The modules are stateless, Data Manager gives job to all the modules.
● Data Manager holds the state of entire pipeline in Mysql
● Has timeouts to each job so that if it fails, then it will again start.
5. Joins
● Joins need the keys from each dataset to be in same partition.
● If both dataset’s doesn’t have same partitioner, then we need to shuffle the
data which makes sure same keys across dataset’s lies in same partitioner.
● Couple of Join strategies used in dataframe are sort merge and broadcast
joins.
6. Problem Statement
● Need to do a left outer Join of 12 datasets(A1…..A12) in which 10 datasets are
below 10mb size and 2 are between 25-30mb with a dataset(B) which is
around 50gb with approx 8 cores.
B.join(A1...A2, “left_outer”)
● After join, need to do a groupBy and then select a row from the group.
● All files are in Parquet format.
7. Issues
● We have to actually join one by one datasets (A1….A12) to B. So it’s actually 12
joins.
● After doing a groupBy, and working on the group to select a row will lead to
memory out of exception as a row is very huge.
8. Steps to solve issues
● Divide the large dataset B into chunks of 500mb and say the chunks are
(B1...Bn). This will make sure that we are joining and solving groupBy issue to a
500mb file at a time
● Sort each dataset from (B1...Bn) with the joinkeys which will make sure Unique
keys of Big data set reside in same partition.
● Join Each 500mb with other 12 datasets(A1...A12).
val joinedDF = allEventsDF.foldLeft(sortedBaseSourceDF)((x, y) => x.join(y._2,
getJoinColumnExpression(x, y._2, joinKeys, y._1), "left_outer"))
9. Contd...
● Now tasks is to do a groupBy on each 500mb chunked joined data.
● Now working on entire row giving us memory out exceptions, we added a
hashcode to the joined dataset and the selected the required columns along
with the hashCode.
● We do a map partition on the join dataset and take an iterator of 100 rows at a
time from each partition.
10. Contd...
● As we work on only 100 rows at a time, we do a aggregateByKey where it has
a combining stage which combines the same keys across 100 row chunks and
merging stage which combine the same keys across the partitions.
val allEventsResponseRDD = reqDF.mapPartitions(makingATuple).aggregateByKey(List[(Int, Row)]())((x, y)
=> (y._1, y._2) :: x, reduceListFunc)
● We join the actual resultant dataset with the actual join dataset with hashcol to
get all the other columns.
val allEventsResponseFullDF = rowWithHashDF.join(allEventsResponseDF, rowWithHashDF("hashCol")
===allEventsResponseDF("hashCol"), "inner").drop(allEventsResponseDF("hashCol"))
11. Contd...
● Now we get (c1….cn) resultant dataset as we have (B1….Bn) dataset’s of B.
● We do a union of all datasets c1….cn and get final dataset D.