Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
ETL and pivoting in spark
1. ETL, pivoting and Handling Small File Problems in Spark
Extracting, Putting several transformation and Finally Loading the summarized data into hive
is the most important part of Data Warehousing. Now we face various types of problems in
spark in terms of developing you basic Data Quality Checking. So it is always
recommendable to pass the through the Data with Custom Data Quality checking steps like:
1. Null Checking in String Field
2. Null checking in Numeric Field
3. Alfa-Numeric Characters in Numeric field
4. Data Type selection on the basis of future requirements
5. Data format conversion(Most Important)
6. Filter Data
7. Address, SSN, Telephone, Email id validation etc.
In the transformation phase Spark demands many User Defined Functions as our
requirement goes more complex
Transformation like:
1. Aggregation
2. Routing
3. Normalization
4. De-Normalization
5. Intelligent Counter
6. Lookup
Load phase is putting your temporary table into Hive or HBase or Cassandra and use any
Visualization tool to show the outcome.
Now this article looks into another aspect of Small files handling in Spark which is really
important. It is said to keep in mind that “Don’t let your partition volume too high (Greater
than 2GB and don’t even make it too small which will cause overhead problem”
Now my data source is consisting of many small files, so do look at this step:
2. Now this execution plan itself shows the beauty of this hack and efficient use of broadcast
variable in spark.
This will definitely reduce down your I/O overhead problems for and provide a better result
in terms of performance.
So the data source is something like this:
The schema goes like this:
Now this data have different null problems where we need to create custom function in
RDD level and format the data.
Another problem with this data is the date format was not same throughout the file,
somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data
Quality and conversion checking was required.
val dataRDD = data.map(line =>
line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim
,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul
l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub
le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu
mericNull(line(10)).trim.toInt));
This has the required conversion and checking.
Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes
like this
3. df.registerTempTable("cricket_data");
val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4'
when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3)
then 'Q1' end Quarter, run_scored from (select
name,year(convert(REPLACE(date_of_match,'/','-'))) as
year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from
cricket_data) C");
Convert and REPLACE are custom UDF for this Job
Now this query gives me a result like this:
Now in terms of Data Warehouse this is very inefficient data. As the business user
demands summarized data with full visibility throughout the timestamp.
Here in ETL we use a component called “De-Normalizer” [In Informatica]
So it required transformations like:
4. Aggregator has a sorter which sorts the data first and then implements the aggregation.
Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion
it suffers a big time due to less efficient cache and data mapping
Spark gives a brilliant solution to pivot ta the data in a single line:
val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored"))
This is an action which pivots the data and transposes huge volume of data within few
minutes.
The data goes like this:
Explain Plan for the Query
5. Explain Plan for the Pivot
We Load this summarized data in hive and show to the End user , So this how my table got
stored in hive.
Data in Hive