SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Pig -
Working with XML
Ram Kedem
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Working with XML
• This presentation is intended to demonstrate how to
processes XML data using PIG.
• The data used to illustrate this topic is taken from MSSQL
Northwind database
• This presentation is based on
• hadoop-1.1.2
• pig-0.11.1
Results may vary under different version of Pig or Hadoop
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Sample Data
<orders>
<OrderID>10248</OrderID>
<CustomerID>VINET</CustomerID>
<EmployeeID>5</EmployeeID>
<OrderDate>1996-07-04T14:15:14.257</OrderDate>
<RequiredDate>1996-08-01T14:15:14.257</RequiredDate>
<ShippedDate>1996-07-16T14:15:14.257</ShippedDate>
<ShipCity>Reims</ShipCity>
<ShipCountry>France</ShipCountry>
</orders>
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Load Data – Move into HDFS
Our first step is to move the XML file to the HDFS as we’re
not working in Local Mode
• hadoop fs -put ORDERS_XML.xml
/user/hduser/orders/orders_xml.xml
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Load XML Using Pig
Next, from Pig Shell we’ll register Piggybank and load the XML using
XMLLoader
REGISTER '/home/hduser/Downloads/piggybank.jar' ;
LOAD_ORDERS = LOAD
'/user/hduser/orders/orders_xml.xml' USING
org.apache.pig.piggybank.storage.XMLLoader('orders')
AS (mydoc:chararray);
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
XML to Pig Structure
• Next we’ll translate the XML structure into a format Pig can
understand.
• This phase involves two steps :
• Using Regular Expression translate the XML structure into a Pig
“table” (GENERATE FLATTEN)
• Map each column in that table and name it (AS)
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
XML to Pig Structure
CLEAN = FOREACH LOAD_ORDERS GENERATE
FLATTEN
(REGEX_EXTRACT_ALL(mydoc,
'<orders>s*<OrderID>(.*)</OrderID>s*<Cu
stomerID>(.*)</CustomerID>s*<EmployeeID>(
.*)</EmployeeID>s*<OrderDate>(.*)</OrderD
ate>s*<RequiredDate>(.*)</RequiredDate>
s*<ShippedDate>(.*)</ShippedDate>s*<ShipC
ity>(.*)</ShipCity>s*<ShipCountry>(.*)</S
hipCountry>s*</orders>'))
AS
(OrderID:chararray,CustomerID:chararray,Emp
loyeeID:chararray,OrderDate:chararray,Requi
redDate:chararray,ShippedDate:chararray,Shi
pCity:chararray,ShipCountry:chararray) ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Filter by Id (int)
• From this step we can start and query our data.
• First query is filter by INT, As we didn't map the columns to
the right datatype, we'll use conversion functions.
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN
BY OrderID == 11066 ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Filter by date, using GetYear
• Same goes with dates, only this time we'll use the GetYear
function as well
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry,
GetYear(ToDate(ShippedDate)) AS
YearShippedDate;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN
BY YearShippedDate == 2014 ;
DUMP FILTER_CONV_CLEAN ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Comparing Dates using
DaysBetween
• The value can be either :
• Negative (ShippedDate is before 1998-05-04)
• Positive (ShippedDate is after 1998-05-04)
• Equal to zero (ShippedDate is equal to 1998-05-04)
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID,
ShipCity,
ShipCountry,
ToDate(ShippedDate) AS GeneralDate,
DaysBetween(
ToDate(ShippedDate) ,
ToDate('1998-05-04', 'yyyy-MM-dd')
) ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Comparing Dates using
DaysBetween
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry,
ToDate(ShippedDate) AS GeneralDate;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY
DaysBetween
(GeneralDate ,
ToDate('2014-11-29', 'yyyy-MM-dd')
) == (long)0;
DUMP FILTER_CONV_CLEAN ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
How many orders shipped per
year and month ?
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID,
GetYear(ToDate(ShippedDate)) AS YearShippedDate ,
GetMonth(ToDate(ShippedDate)) AS MonthShippedDate;
GROUP_CLEAN = GROUP GENERATE_CLEAN BY
(YearShippedDate,MonthShippedDate) ;
COUNT_ORDERS = FOREACH GROUP_CLEAN
GENERATE group,
COUNT(GENERATE_CLEAN.OrderID) AS Count ;
ORDER_COUNT_ORDERS = ORDER COUNT_POSTS BY Count ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
List occurrences of "Germany" / "GER"
(case insensitive search)
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ;
FILTER_BY_SC = FILTER GENERATE_CLEAN
BY
UPPER(ShipCountry) == 'GERMANY'
OR
UPPER(ShipCountry) MATCHES '.*GER.*' ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Top 10 Orders
(by ShippedDate)
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ,
ToDate(ShippedDate) AS ShippedDate ;
ORDER_LIST = ORDER GENERATE_CLEAN BY
ShippedDate DESC ;
TOP_10 = LIMIT ORDER_LIST 10 ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Top 3 ShippedDate per year
GENERATE_CLEAN = FOREACH CLEAN GENERATE
GetYear(ToDate(ShippedDate)) AS
YearCreationDate , ShipCity, ShipCountry,
ToDate(ShippedDate) AS ShipDates;
GROUP_CLEAN = GROUP GENERATE_CLEAN BY
YearCreationDate ;
TOP_3 = FOREACH GROUP_CLEAN
{
RESULT = TOP(3, 3, GENERATE_CLEAN) ;
GENERATE group, RESULT;
} ;
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Using DaysBetween
GENERATE_CLEAN = FOREACH CLEAN GENERATE OrderID,
ShipCity, ShipCountry, RequiredDate, OrderDate,
DaysBetween(ToDate(RequiredDate),
ToDate(OrderDate)) AS DaysBetween ;
GENERATE_CLEAN = LIMIT GENERATE_CLEAN 1000 ;
ORDER_LIST =
ORDER GENERATE_CLEAN BY DaysBetween DESC ;
TOP_10 = LIMIT ORDER_LIST 10 ;

Weitere ähnliche Inhalte

Ähnlich wie Pig - Processing XML data

Data localization and translation
Data localization and translationData localization and translation
Data localization and translation
Motti Danino
 

Ähnlich wie Pig - Processing XML data (20)

nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
 
3 indexes
3 indexes3 indexes
3 indexes
 
No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...
No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...
No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...
 
SSRS Calculated Fields
SSRS Calculated FieldsSSRS Calculated Fields
SSRS Calculated Fields
 
Terraform
TerraformTerraform
Terraform
 
C++ L02-Conversion+enum+Operators
C++ L02-Conversion+enum+OperatorsC++ L02-Conversion+enum+Operators
C++ L02-Conversion+enum+Operators
 
C++ Generators and Property-based Testing
C++ Generators and Property-based TestingC++ Generators and Property-based Testing
C++ Generators and Property-based Testing
 
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas MurthyOrchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy
Orchestrating Big Data pipelines @ Fandom - Krystian Mistrzak Thejas Murthy
 
Modern Perl
Modern PerlModern Perl
Modern Perl
 
Data localization and translation
Data localization and translationData localization and translation
Data localization and translation
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Data Localization and Translation
Data Localization and TranslationData Localization and Translation
Data Localization and Translation
 
Agile data presentation 3 - cambridge
Agile data   presentation 3 - cambridgeAgile data   presentation 3 - cambridge
Agile data presentation 3 - cambridge
 
4 execution plans
4 execution plans4 execution plans
4 execution plans
 
Data Mining in SSAS
Data Mining in SSASData Mining in SSAS
Data Mining in SSAS
 
Data mining In SSAS
Data mining In SSASData mining In SSAS
Data mining In SSAS
 
Advanced Container Automation, Security, and Monitoring - AWS Summit Sydney 2018
Advanced Container Automation, Security, and Monitoring - AWS Summit Sydney 2018Advanced Container Automation, Security, and Monitoring - AWS Summit Sydney 2018
Advanced Container Automation, Security, and Monitoring - AWS Summit Sydney 2018
 
Exalytics, DR, EPM Multi-Instance Over Bare Metal, and Tying it All Together
Exalytics, DR, EPM Multi-Instance Over Bare Metal, and Tying it All TogetherExalytics, DR, EPM Multi-Instance Over Bare Metal, and Tying it All Together
Exalytics, DR, EPM Multi-Instance Over Bare Metal, and Tying it All Together
 
exalytics_kscope
exalytics_kscopeexalytics_kscope
exalytics_kscope
 

Mehr von Ram Kedem

Mehr von Ram Kedem (20)

Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Advanced SQL Webinar
Advanced SQL WebinarAdvanced SQL Webinar
Advanced SQL Webinar
 
Managing oracle Database Instance
Managing oracle Database InstanceManaging oracle Database Instance
Managing oracle Database Instance
 
Power Pivot and Power View
Power Pivot and Power ViewPower Pivot and Power View
Power Pivot and Power View
 
SQL Injections - Oracle
SQL Injections - OracleSQL Injections - Oracle
SQL Injections - Oracle
 
SSAS Attributes
SSAS AttributesSSAS Attributes
SSAS Attributes
 
SSRS Matrix
SSRS MatrixSSRS Matrix
SSRS Matrix
 
DDL Practice (Hebrew)
DDL Practice (Hebrew)DDL Practice (Hebrew)
DDL Practice (Hebrew)
 
DML Practice (Hebrew)
DML Practice (Hebrew)DML Practice (Hebrew)
DML Practice (Hebrew)
 
Exploring Oracle Database Architecture (Hebrew)
Exploring Oracle Database Architecture (Hebrew)Exploring Oracle Database Architecture (Hebrew)
Exploring Oracle Database Architecture (Hebrew)
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Introduction to Databases
Introduction to DatabasesIntroduction to Databases
Introduction to Databases
 
Deploy SSRS Project - SQL Server 2014
Deploy SSRS Project - SQL Server 2014Deploy SSRS Project - SQL Server 2014
Deploy SSRS Project - SQL Server 2014
 
SSAS Cubes & Hierarchies
SSAS Cubes & HierarchiesSSAS Cubes & Hierarchies
SSAS Cubes & Hierarchies
 
SSRS Basic Parameters
SSRS Basic ParametersSSRS Basic Parameters
SSRS Basic Parameters
 
SSRS Gauges
SSRS GaugesSSRS Gauges
SSRS Gauges
 
SSRS Conditional Formatting
SSRS Conditional FormattingSSRS Conditional Formatting
SSRS Conditional Formatting
 
SSRS Groups
SSRS GroupsSSRS Groups
SSRS Groups
 
Deploy SSIS
Deploy SSISDeploy SSIS
Deploy SSIS
 
SSIS Incremental ETL process
SSIS Incremental ETL processSSIS Incremental ETL process
SSIS Incremental ETL process
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Pig - Processing XML data

  • 1. Pig - Working with XML Ram Kedem
  • 2. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Working with XML • This presentation is intended to demonstrate how to processes XML data using PIG. • The data used to illustrate this topic is taken from MSSQL Northwind database • This presentation is based on • hadoop-1.1.2 • pig-0.11.1 Results may vary under different version of Pig or Hadoop
  • 3. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Sample Data <orders> <OrderID>10248</OrderID> <CustomerID>VINET</CustomerID> <EmployeeID>5</EmployeeID> <OrderDate>1996-07-04T14:15:14.257</OrderDate> <RequiredDate>1996-08-01T14:15:14.257</RequiredDate> <ShippedDate>1996-07-16T14:15:14.257</ShippedDate> <ShipCity>Reims</ShipCity> <ShipCountry>France</ShipCountry> </orders>
  • 4. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Load Data – Move into HDFS Our first step is to move the XML file to the HDFS as we’re not working in Local Mode • hadoop fs -put ORDERS_XML.xml /user/hduser/orders/orders_xml.xml
  • 5. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Load XML Using Pig Next, from Pig Shell we’ll register Piggybank and load the XML using XMLLoader REGISTER '/home/hduser/Downloads/piggybank.jar' ; LOAD_ORDERS = LOAD '/user/hduser/orders/orders_xml.xml' USING org.apache.pig.piggybank.storage.XMLLoader('orders') AS (mydoc:chararray);
  • 6. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com XML to Pig Structure • Next we’ll translate the XML structure into a format Pig can understand. • This phase involves two steps : • Using Regular Expression translate the XML structure into a Pig “table” (GENERATE FLATTEN) • Map each column in that table and name it (AS)
  • 7. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com XML to Pig Structure CLEAN = FOREACH LOAD_ORDERS GENERATE FLATTEN (REGEX_EXTRACT_ALL(mydoc, '<orders>s*<OrderID>(.*)</OrderID>s*<Cu stomerID>(.*)</CustomerID>s*<EmployeeID>( .*)</EmployeeID>s*<OrderDate>(.*)</OrderD ate>s*<RequiredDate>(.*)</RequiredDate> s*<ShippedDate>(.*)</ShippedDate>s*<ShipC ity>(.*)</ShipCity>s*<ShipCountry>(.*)</S hipCountry>s*</orders>')) AS (OrderID:chararray,CustomerID:chararray,Emp loyeeID:chararray,OrderDate:chararray,Requi redDate:chararray,ShippedDate:chararray,Shi pCity:chararray,ShipCountry:chararray) ;
  • 8. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Filter by Id (int) • From this step we can start and query our data. • First query is filter by INT, As we didn't map the columns to the right datatype, we'll use conversion functions. CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry ; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY OrderID == 11066 ;
  • 9. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Filter by date, using GetYear • Same goes with dates, only this time we'll use the GetYear function as well CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, GetYear(ToDate(ShippedDate)) AS YearShippedDate; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY YearShippedDate == 2014 ; DUMP FILTER_CONV_CLEAN ;
  • 10. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Comparing Dates using DaysBetween • The value can be either : • Negative (ShippedDate is before 1998-05-04) • Positive (ShippedDate is after 1998-05-04) • Equal to zero (ShippedDate is equal to 1998-05-04) CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, ToDate(ShippedDate) AS GeneralDate, DaysBetween( ToDate(ShippedDate) , ToDate('1998-05-04', 'yyyy-MM-dd') ) ;
  • 11. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Comparing Dates using DaysBetween CONV_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry, ToDate(ShippedDate) AS GeneralDate; FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY DaysBetween (GeneralDate , ToDate('2014-11-29', 'yyyy-MM-dd') ) == (long)0; DUMP FILTER_CONV_CLEAN ;
  • 12. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com How many orders shipped per year and month ? GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, GetYear(ToDate(ShippedDate)) AS YearShippedDate , GetMonth(ToDate(ShippedDate)) AS MonthShippedDate; GROUP_CLEAN = GROUP GENERATE_CLEAN BY (YearShippedDate,MonthShippedDate) ; COUNT_ORDERS = FOREACH GROUP_CLEAN GENERATE group, COUNT(GENERATE_CLEAN.OrderID) AS Count ; ORDER_COUNT_ORDERS = ORDER COUNT_POSTS BY Count ;
  • 13. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com List occurrences of "Germany" / "GER" (case insensitive search) GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry ; FILTER_BY_SC = FILTER GENERATE_CLEAN BY UPPER(ShipCountry) == 'GERMANY' OR UPPER(ShipCountry) MATCHES '.*GER.*' ;
  • 14. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Top 10 Orders (by ShippedDate) GENERATE_CLEAN = FOREACH CLEAN GENERATE (int)OrderID, ShipCity, ShipCountry , ToDate(ShippedDate) AS ShippedDate ; ORDER_LIST = ORDER GENERATE_CLEAN BY ShippedDate DESC ; TOP_10 = LIMIT ORDER_LIST 10 ;
  • 15. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Top 3 ShippedDate per year GENERATE_CLEAN = FOREACH CLEAN GENERATE GetYear(ToDate(ShippedDate)) AS YearCreationDate , ShipCity, ShipCountry, ToDate(ShippedDate) AS ShipDates; GROUP_CLEAN = GROUP GENERATE_CLEAN BY YearCreationDate ; TOP_3 = FOREACH GROUP_CLEAN { RESULT = TOP(3, 3, GENERATE_CLEAN) ; GENERATE group, RESULT; } ;
  • 16. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent ramkedem.com Using DaysBetween GENERATE_CLEAN = FOREACH CLEAN GENERATE OrderID, ShipCity, ShipCountry, RequiredDate, OrderDate, DaysBetween(ToDate(RequiredDate), ToDate(OrderDate)) AS DaysBetween ; GENERATE_CLEAN = LIMIT GENERATE_CLEAN 1000 ; ORDER_LIST = ORDER GENERATE_CLEAN BY DaysBetween DESC ; TOP_10 = LIMIT ORDER_LIST 10 ;