SlideShare a Scribd company logo
1 of 6
ETL, pivoting and Handling Small File Problems in Spark
Extracting, Putting several transformation and Finally Loading the summarized data into hive
is the most important part of Data Warehousing. Now we face various types of problems in
spark in terms of developing you basic Data Quality Checking. So it is always
recommendable to pass the through the Data with Custom Data Quality checking steps like:
1. Null Checking in String Field
2. Null checking in Numeric Field
3. Alfa-Numeric Characters in Numeric field
4. Data Type selection on the basis of future requirements
5. Data format conversion(Most Important)
6. Filter Data
7. Address, SSN, Telephone, Email id validation etc.
In the transformation phase Spark demands many User Defined Functions as our
requirement goes more complex
Transformation like:
1. Aggregation
2. Routing
3. Normalization
4. De-Normalization
5. Intelligent Counter
6. Lookup
Load phase is putting your temporary table into Hive or HBase or Cassandra and use any
Visualization tool to show the outcome.
Now this article looks into another aspect of Small files handling in Spark which is really
important. It is said to keep in mind that “Don’t let your partition volume too high (Greater
than 2GB and don’t even make it too small which will cause overhead problem”
Now my data source is consisting of many small files, so do look at this step:
Now this execution plan itself shows the beauty of this hack and efficient use of broadcast
variable in spark.
This will definitely reduce down your I/O overhead problems for and provide a better result
in terms of performance.
So the data source is something like this:
The schema goes like this:
Now this data have different null problems where we need to create custom function in
RDD level and format the data.
Another problem with this data is the date format was not same throughout the file,
somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data
Quality and conversion checking was required.
val dataRDD = data.map(line =>
line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim
,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul
l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub
le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu
mericNull(line(10)).trim.toInt));
This has the required conversion and checking.
Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes
like this
df.registerTempTable("cricket_data");
val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4'
when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3)
then 'Q1' end Quarter, run_scored from (select
name,year(convert(REPLACE(date_of_match,'/','-'))) as
year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from
cricket_data) C");
Convert and REPLACE are custom UDF for this Job
Now this query gives me a result like this:
Now in terms of Data Warehouse this is very inefficient data. As the business user
demands summarized data with full visibility throughout the timestamp.
Here in ETL we use a component called “De-Normalizer” [In Informatica]
So it required transformations like:
Aggregator has a sorter which sorts the data first and then implements the aggregation.
Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion
it suffers a big time due to less efficient cache and data mapping
Spark gives a brilliant solution to pivot ta the data in a single line:
val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored"))
This is an action which pivots the data and transposes huge volume of data within few
minutes.
The data goes like this:
Explain Plan for the Query
Explain Plan for the Pivot
We Load this summarized data in hive and show to the End user , So this how my table got
stored in hive.
Data in Hive
A very simple way to handle ETL in Spark! 

More Related Content

What's hot

Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Kiruthikak14
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Abhijit Sharma
 

What's hot (20)

HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
Cloud Strategy Architecture for multi country deployment
Cloud Strategy Architecture for multi country deploymentCloud Strategy Architecture for multi country deployment
Cloud Strategy Architecture for multi country deployment
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
Talend Open Studio For Data Integration Training Curriculum
Talend Open Studio For Data Integration Training CurriculumTalend Open Studio For Data Integration Training Curriculum
Talend Open Studio For Data Integration Training Curriculum
 
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Berlin buzzwords 2018
Berlin buzzwords 2018Berlin buzzwords 2018
Berlin buzzwords 2018
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
Excel Database Function
Excel Database FunctionExcel Database Function
Excel Database Function
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
vJUG - Introduction to data streaming
vJUG - Introduction to data streamingvJUG - Introduction to data streaming
vJUG - Introduction to data streaming
 
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysisAn introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
 
Dgraph: Graph database for production environment
Dgraph:  Graph database for production environmentDgraph:  Graph database for production environment
Dgraph: Graph database for production environment
 

Viewers also liked

Viewers also liked (19)

Operaciones con diploma
Operaciones con diplomaOperaciones con diploma
Operaciones con diploma
 
Formato 1 kyab
Formato 1 kyabFormato 1 kyab
Formato 1 kyab
 
Teknoloji ile Sosyal Fark Yaratabilmek
Teknoloji ile Sosyal Fark YaratabilmekTeknoloji ile Sosyal Fark Yaratabilmek
Teknoloji ile Sosyal Fark Yaratabilmek
 
Andamio vvv, sza
Andamio vvv, szaAndamio vvv, sza
Andamio vvv, sza
 
Magazine term
Magazine termMagazine term
Magazine term
 
19 corp presentation-bigevent-timmins-handouts-01jun16
19 corp presentation-bigevent-timmins-handouts-01jun1619 corp presentation-bigevent-timmins-handouts-01jun16
19 corp presentation-bigevent-timmins-handouts-01jun16
 
Allan Ndondo CV 2
Allan Ndondo CV 2Allan Ndondo CV 2
Allan Ndondo CV 2
 
Пакет "Тийм билдинг" Стара Загора
Пакет "Тийм билдинг" Стара ЗагораПакет "Тийм билдинг" Стара Загора
Пакет "Тийм билдинг" Стара Загора
 
TIK BAB 5 KELAS 9
TIK BAB 5 KELAS 9TIK BAB 5 KELAS 9
TIK BAB 5 KELAS 9
 
Lista
ListaLista
Lista
 
Digitalizado em 30 09-2015 18-36
Digitalizado em 30 09-2015 18-36Digitalizado em 30 09-2015 18-36
Digitalizado em 30 09-2015 18-36
 
CV William 2016
CV William 2016CV William 2016
CV William 2016
 
I banchieri delle banche centrali vanno a gonfie vele
I banchieri delle banche centrali vanno a gonfie veleI banchieri delle banche centrali vanno a gonfie vele
I banchieri delle banche centrali vanno a gonfie vele
 
039 Integrating Constructability-short
039 Integrating Constructability-short039 Integrating Constructability-short
039 Integrating Constructability-short
 
054 JIT and Quality Management System
054 JIT and Quality Management System054 JIT and Quality Management System
054 JIT and Quality Management System
 
Manual practico de flores de bach malena lede
Manual practico de flores de bach   malena ledeManual practico de flores de bach   malena lede
Manual practico de flores de bach malena lede
 
Estd ctaoct2014
Estd ctaoct2014Estd ctaoct2014
Estd ctaoct2014
 
2012-06-26 SharePoint Konferent Wien - Mobile SharePoint Solutions
2012-06-26 SharePoint Konferent Wien - Mobile SharePoint Solutions2012-06-26 SharePoint Konferent Wien - Mobile SharePoint Solutions
2012-06-26 SharePoint Konferent Wien - Mobile SharePoint Solutions
 
Keynote #3 when policy meets design by jung joo lee
Keynote #3 when policy meets design by jung joo leeKeynote #3 when policy meets design by jung joo lee
Keynote #3 when policy meets design by jung joo lee
 

Similar to ETL and pivoting in spark

Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Similar to ETL and pivoting in spark (20)

Ibm redbook
Ibm redbookIbm redbook
Ibm redbook
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
ETL
ETL ETL
ETL
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 
Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)Java Developers, make the database work for you (NLJUG JFall 2010)
Java Developers, make the database work for you (NLJUG JFall 2010)
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache CassandraMovile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
 
Cassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of SeasonsCassandra Summit 2015 - A Change of Seasons
Cassandra Summit 2015 - A Change of Seasons
 
My C.V
My C.VMy C.V
My C.V
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptxWhat Are the Key Steps in Scraping Product Data from Amazon India.pptx
What Are the Key Steps in Scraping Product Data from Amazon India.pptx
 
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdfWhat Are the Key Steps in Scraping Product Data from Amazon India.pdf
What Are the Key Steps in Scraping Product Data from Amazon India.pdf
 

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

ETL and pivoting in spark

  • 1. ETL, pivoting and Handling Small File Problems in Spark Extracting, Putting several transformation and Finally Loading the summarized data into hive is the most important part of Data Warehousing. Now we face various types of problems in spark in terms of developing you basic Data Quality Checking. So it is always recommendable to pass the through the Data with Custom Data Quality checking steps like: 1. Null Checking in String Field 2. Null checking in Numeric Field 3. Alfa-Numeric Characters in Numeric field 4. Data Type selection on the basis of future requirements 5. Data format conversion(Most Important) 6. Filter Data 7. Address, SSN, Telephone, Email id validation etc. In the transformation phase Spark demands many User Defined Functions as our requirement goes more complex Transformation like: 1. Aggregation 2. Routing 3. Normalization 4. De-Normalization 5. Intelligent Counter 6. Lookup Load phase is putting your temporary table into Hive or HBase or Cassandra and use any Visualization tool to show the outcome. Now this article looks into another aspect of Small files handling in Spark which is really important. It is said to keep in mind that “Don’t let your partition volume too high (Greater than 2GB and don’t even make it too small which will cause overhead problem” Now my data source is consisting of many small files, so do look at this step:
  • 2. Now this execution plan itself shows the beauty of this hack and efficient use of broadcast variable in spark. This will definitely reduce down your I/O overhead problems for and provide a better result in terms of performance. So the data source is something like this: The schema goes like this: Now this data have different null problems where we need to create custom function in RDD level and format the data. Another problem with this data is the date format was not same throughout the file, somewhere it’s like dd/mm/yyyy and somewhere dd-mm-yyyy. So serious amount of Data Quality and conversion checking was required. val dataRDD = data.map(line => line.split(",")).map(line=>ScoreRecord(checkStrNull(line(0)).trim,checkStrNull(line(1)).trim ,checkStrNull(line(2)).trim,checkStrNull(line(3)).trim,checkStrNull(line(4)).trim,checkStrNul l(line(5)).trim,checkNumericNull(line(6)).trim.toInt,checkNumericNull(line(7)).trim.toDoub le,checkNumericNull(line(8)).trim.toInt,checkNumericNull(line(9)).trim.toDouble,checkNu mericNull(line(10)).trim.toInt)); This has the required conversion and checking. Now I developed Spark SQL UDF to handle the data conversion problem, So my code goes like this
  • 3. df.registerTempTable("cricket_data"); val result = sqlContext.sql("select name,year,case when month in (10,11,12) then 'Q4' when month in (7,8,9) then 'Q3' when month in (4,5,6) then 'Q2' when month in(1,2,3) then 'Q1' end Quarter, run_scored from (select name,year(convert(REPLACE(date_of_match,'/','-'))) as year,month(convert(REPLACE(date_of_match,'/','-'))) as month,run_scored from cricket_data) C"); Convert and REPLACE are custom UDF for this Job Now this query gives me a result like this: Now in terms of Data Warehouse this is very inefficient data. As the business user demands summarized data with full visibility throughout the timestamp. Here in ETL we use a component called “De-Normalizer” [In Informatica] So it required transformations like:
  • 4. Aggregator has a sorter which sorts the data first and then implements the aggregation. Now these are costly transformations in terms of ETL. If we are having data volume 1 Billion it suffers a big time due to less efficient cache and data mapping Spark gives a brilliant solution to pivot ta the data in a single line: val result_pivot = result.groupBy("name","year").pivot("Quarter").agg(sum("run_scored")) This is an action which pivots the data and transposes huge volume of data within few minutes. The data goes like this: Explain Plan for the Query
  • 5. Explain Plan for the Pivot We Load this summarized data in hive and show to the End user , So this how my table got stored in hive. Data in Hive
  • 6. A very simple way to handle ETL in Spark! 