Pig is a platform for analyzing large datasets that runs on Hadoop. It uses a language called Pig Latin to describe data flows and transformations. Pig Latin scripts allow users to express data analysis workflows like joins and groups. While similar to SQL, Pig Latin describes how data flows rather than producing an inside-out query. Pig makes Hadoop data analysis easier through a simple language and execution on parallel hardware.
2. Why are we talking about Pig?
Originally developed at Yahoo! now an
apache project
Engine for executing data flows in parallel
on Hadoop
Includes a language called Pig Latin for
expressing data flows
Easy to learn and extensible
Open source
3. What is a data flow language
Allows us to describe how data should be loaded, read,
processed and stored.
Can be simple linear flows e.g. word count
Complex workflows that include joins
4. Is it like SQL?
Pig Latin does look a bit like SQL e.g. Join, Group By
But SQL is declarative
In Pig you describe how the data flows
SQL you end up producing an inside out query whereas
with Pig you describe a pipeline.
5. SQL example
SELECT CustomerName,TotalOrders, PostCode
FROM Customers c
INNER JOIN
(
SELECT CustomerId, count(OrderId) as
FROM Orders
GROUP BY CustomerId
) as t on t.CustomerId = c.CustomerId
6. Same Query in Pig
orders = load ‘Orders’ as (CustomerId, OrderId);
grouped = group orders by CustomerId;
total = foreach grouped generate group,
COUNT(OrderId)
customer = load ‘Customers’ as (CustomerId,
CustomerName)
result = join total by group, customer by customerId
dump result;
10. Same Query in Pig
orders = load ‘Orders’ as (CustomerId, OrderId);
grouped = group orders by CustomerId;
total = foreach grouped generate group,
COUNT(OrderId)
customer = load ‘Customers’ as (CustomerId,
CustomerName)
result = join total by group, customer by customerId
dump result;
13. Advantages of Pig
Easy to learn
Can achieve a lot with a small amount of code
E.g. Join example
Well written scripts can be easy to read and easy to
maintain
Has a local mode for testing scripts
Has a unit testing framework
14. Limitations of Pig
Unit testing
High level – often need to drop down into custom UDFs
If you are proficient at C# or F# sometimes this can be
easier to test e.g. Streaming unit allows unit testing.
Still doesn’t play nicely in a windows environment