Weitere ähnliche Inhalte
Ähnlich wie Pig - Processing XML data (20)
Kürzlich hochgeladen (20)
Pig - Processing XML data
- 2. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Working with XML
• This presentation is intended to demonstrate how to
processes XML data using PIG.
• The data used to illustrate this topic is taken from MSSQL
Northwind database
• This presentation is based on
• hadoop-1.1.2
• pig-0.11.1
Results may vary under different version of Pig or Hadoop
- 3. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Sample Data
<orders>
<OrderID>10248</OrderID>
<CustomerID>VINET</CustomerID>
<EmployeeID>5</EmployeeID>
<OrderDate>1996-07-04T14:15:14.257</OrderDate>
<RequiredDate>1996-08-01T14:15:14.257</RequiredDate>
<ShippedDate>1996-07-16T14:15:14.257</ShippedDate>
<ShipCity>Reims</ShipCity>
<ShipCountry>France</ShipCountry>
</orders>
- 4. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Load Data – Move into HDFS
Our first step is to move the XML file to the HDFS as we’re
not working in Local Mode
• hadoop fs -put ORDERS_XML.xml
/user/hduser/orders/orders_xml.xml
- 5. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Load XML Using Pig
Next, from Pig Shell we’ll register Piggybank and load the XML using
XMLLoader
REGISTER '/home/hduser/Downloads/piggybank.jar' ;
LOAD_ORDERS = LOAD
'/user/hduser/orders/orders_xml.xml' USING
org.apache.pig.piggybank.storage.XMLLoader('orders')
AS (mydoc:chararray);
- 6. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
XML to Pig Structure
• Next we’ll translate the XML structure into a format Pig can
understand.
• This phase involves two steps :
• Using Regular Expression translate the XML structure into a Pig
“table” (GENERATE FLATTEN)
• Map each column in that table and name it (AS)
- 7. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
XML to Pig Structure
CLEAN = FOREACH LOAD_ORDERS GENERATE
FLATTEN
(REGEX_EXTRACT_ALL(mydoc,
'<orders>s*<OrderID>(.*)</OrderID>s*<Cu
stomerID>(.*)</CustomerID>s*<EmployeeID>(
.*)</EmployeeID>s*<OrderDate>(.*)</OrderD
ate>s*<RequiredDate>(.*)</RequiredDate>
s*<ShippedDate>(.*)</ShippedDate>s*<ShipC
ity>(.*)</ShipCity>s*<ShipCountry>(.*)</S
hipCountry>s*</orders>'))
AS
(OrderID:chararray,CustomerID:chararray,Emp
loyeeID:chararray,OrderDate:chararray,Requi
redDate:chararray,ShippedDate:chararray,Shi
pCity:chararray,ShipCountry:chararray) ;
- 8. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Filter by Id (int)
• From this step we can start and query our data.
• First query is filter by INT, As we didn't map the columns to
the right datatype, we'll use conversion functions.
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN
BY OrderID == 11066 ;
- 9. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Filter by date, using GetYear
• Same goes with dates, only this time we'll use the GetYear
function as well
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry,
GetYear(ToDate(ShippedDate)) AS
YearShippedDate;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN
BY YearShippedDate == 2014 ;
DUMP FILTER_CONV_CLEAN ;
- 10. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Comparing Dates using
DaysBetween
• The value can be either :
• Negative (ShippedDate is before 1998-05-04)
• Positive (ShippedDate is after 1998-05-04)
• Equal to zero (ShippedDate is equal to 1998-05-04)
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID,
ShipCity,
ShipCountry,
ToDate(ShippedDate) AS GeneralDate,
DaysBetween(
ToDate(ShippedDate) ,
ToDate('1998-05-04', 'yyyy-MM-dd')
) ;
- 11. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Comparing Dates using
DaysBetween
CONV_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry,
ToDate(ShippedDate) AS GeneralDate;
FILTER_CONV_CLEAN = FILTER CONV_CLEAN BY
DaysBetween
(GeneralDate ,
ToDate('2014-11-29', 'yyyy-MM-dd')
) == (long)0;
DUMP FILTER_CONV_CLEAN ;
- 12. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
How many orders shipped per
year and month ?
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID,
GetYear(ToDate(ShippedDate)) AS YearShippedDate ,
GetMonth(ToDate(ShippedDate)) AS MonthShippedDate;
GROUP_CLEAN = GROUP GENERATE_CLEAN BY
(YearShippedDate,MonthShippedDate) ;
COUNT_ORDERS = FOREACH GROUP_CLEAN
GENERATE group,
COUNT(GENERATE_CLEAN.OrderID) AS Count ;
ORDER_COUNT_ORDERS = ORDER COUNT_POSTS BY Count ;
- 13. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
List occurrences of "Germany" / "GER"
(case insensitive search)
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ;
FILTER_BY_SC = FILTER GENERATE_CLEAN
BY
UPPER(ShipCountry) == 'GERMANY'
OR
UPPER(ShipCountry) MATCHES '.*GER.*' ;
- 14. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Top 10 Orders
(by ShippedDate)
GENERATE_CLEAN = FOREACH CLEAN GENERATE
(int)OrderID, ShipCity, ShipCountry ,
ToDate(ShippedDate) AS ShippedDate ;
ORDER_LIST = ORDER GENERATE_CLEAN BY
ShippedDate DESC ;
TOP_10 = LIMIT ORDER_LIST 10 ;
- 15. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Top 3 ShippedDate per year
GENERATE_CLEAN = FOREACH CLEAN GENERATE
GetYear(ToDate(ShippedDate)) AS
YearCreationDate , ShipCity, ShipCountry,
ToDate(ShippedDate) AS ShipDates;
GROUP_CLEAN = GROUP GENERATE_CLEAN BY
YearCreationDate ;
TOP_3 = FOREACH GROUP_CLEAN
{
RESULT = TOP(3, 3, GENERATE_CLEAN) ;
GENERATE group, RESULT;
} ;
- 16. Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Copyright 2014 © Ram Kedem. All rights reserved. Not to be reproduced without written consent
ramkedem.com
Using DaysBetween
GENERATE_CLEAN = FOREACH CLEAN GENERATE OrderID,
ShipCity, ShipCountry, RequiredDate, OrderDate,
DaysBetween(ToDate(RequiredDate),
ToDate(OrderDate)) AS DaysBetween ;
GENERATE_CLEAN = LIMIT GENERATE_CLEAN 1000 ;
ORDER_LIST =
ORDER GENERATE_CLEAN BY DaysBetween DESC ;
TOP_10 = LIMIT ORDER_LIST 10 ;