SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Hands-­‐on	
  Pig	
  with	
  the	
  
NFL	
  Play	
  by	
  Play	
  Dataset	
  
Headline	
  Goes	
  Here	
  
Ryan	
  Bosshart	
  	
  |	
  	
  Systems	
  Engineer	
  	
  
Speaker	
  
Nov	
  2013	
  v1	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  

1

DO	
  NOT	
  USE	
  PUBLICLY	
  
PRIOR	
  TO	
  10/23/12	
  
Outline	
  

•  What	
  is	
  Pig	
  
	
  

•  Pig	
  LaLn	
  by	
  Example	
  
	
  

•  Data	
  Model/Architecture	
  
	
  

•  Hands-­‐on	
  with	
  Pig	
  

2
What	
  is	
  Pig? 	
  	
  
Give	
  me	
  every	
  run	
  in	
  the	
  2010	
  season:	
  
	
  
SELECT	
  *	
  FROM	
  playbyplay	
  WHERE	
  playtype	
  =	
  "RUN”	
  and	
  year	
  =	
  2010;	
  

playbyplay	
  =	
  LOAD	
  'playbyplay’	
  ….;	
  
run_plays	
  =	
  FILTER	
  playbyplay	
  BY	
  (playtype=='RUN')	
  AND	
  (year==2010);	
  
DUMP	
  run_plays;	
  

3	
  
Components	
  
•  Pig	
  resides	
  on	
  user	
  machine	
  
•  Job	
  submiced	
  to	
  cluster	
  &	
  executed	
  on	
  cluster	
  
	
  
•  No	
  need	
  to	
  install	
  anything	
  extra	
  on	
  cluster	
  

Hadoop	
  
Cluster	
  

Pig	
  Input	
  
(Client	
  Machine)	
  

4

©2011 Cloudera, Inc. All Rights
Reserved.
Accessing Pig	
  
•  Grunt,	
  the	
  pig	
  shell	
  
•  Submit	
  a	
  script	
  directly	
  
•  PigServer	
  Java	
  class,	
  a	
  JDBC	
  like	
  interface	
  
•  Hue	
  
•  Allows	
  textual	
  &	
  graphical	
  scripLng	
  

5
Example	
  

6	
  
How	
  Pig	
  Works	
  

Pig	
  La2n:	
  Count	
  Job	
  
	
  
A = LOAD ‘myfile’
AS (x, y, z);
B = FILTER A by x> 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO ‘output’;

7

• 
• 
• 
• 
• 
• 

Parses	
  
Checks	
  
OpLmizes	
  
Plans	
  execuLon	
  
Submits	
  jar	
  to	
  Hadoop	
  
Monitors	
  job	
  progress	
  

ExecuLon	
  Plan	
  
Map:	
  Filter	
  
Reduce:	
  Counter	
  
Starting Grunt	
  
•  To	
  start	
  the	
  Pig	
  shell	
  (Grunt),	
  start	
  a	
  terminal	
  and	
  run	
  
$ pig
$ pig –x local

	
  

•  Should	
  see	
  a	
  prompt	
  like:	
  
grunt>

8

©2011 Cloudera, Inc. All Rights
Reserved.
Data Types	
  
•  Scalar	
  types	
  
• 
• 
• 
• 
• 
• 
	
  

Int	
  
Long	
  
Float	
  
Double	
  
Chararray	
  
Bytearray	
  

•  Complex	
  types	
  

•  Map:	
  associaLve	
  array	
  
•  Tuple:	
  ordered	
  list	
  of	
  data,	
  elements	
  may	
  be	
  of	
  any	
  scalar	
  or	
  complex	
  type	
  
•  Bag:	
  unordered	
  collecLon	
  of	
  tuples	
  

9

©2011 Cloudera, Inc. All Rights
Reserved.
Load Returns a Bag	
  

•  LOAD	
  statements	
  return	
  a	
  tuple.	
  Each	
  tuple	
  has	
  mulLple	
  
elements,	
  which	
  can	
  be	
  referenced	
  by	
  posiLon	
  or	
  by	
  name.	
  

	
  
arrests	
  =	
  LOAD	
  'arrests.csv'	
  USING	
  PigStorage(',')	
  AS(year:int,	
  team:chararray,	
  player:chararray);

•  A	
  set	
  of	
  tuples	
  is	
  referred	
  to	
  as	
  a	
  bag	
  (normally	
  unordered)

(2002,KC,Willie	
  Roaf)	
  
(2002,OAK,Darrell	
  Russell)	
  
(2002,NYJ,Aaron	
  Beasley)	
  

10

©2011 Cloudera, Inc. All Rights
Reserved.
Bags & FOREACH	
  

•  The	
  FOREACH…GENERATE	
  statement	
  iterates	
  over	
  the	
  members	
  
of	
  a	
  bag	
  
	
  
Players = FOREACH arrests GENERATE player;

	
  
•  The	
  result	
  of	
  a	
  FOREACH	
  is	
  another	
  bag	
  
•  Elements	
  are	
  named	
  as	
  the	
  input	
  bag	
  

11

©2011 Cloudera, Inc. All Rights
Reserved.
Positional Reference	
  

•  The	
  following	
  creates	
  idenLcal	
  output	
  data	
  
	
  
Players = FOREACH arrests GENERATE $2;

	
  
•  …But	
  the	
  elements	
  of	
  arrests	
  aren’t	
  named	
  “player”	
  –	
  unless	
  
you	
  do	
  this:	
  
	
  
Players = FOREACH arrest GENERATE $2 AS player;

12

©2011 Cloudera, Inc. All Rights
Reserved.
Grouping	
  
•  In	
  Pig	
  grouping	
  is	
  a	
  separate	
  operaLon	
  from	
  applying	
  aggregate	
  
funcLons	
  
	
  

•  The	
  output	
  of	
  the	
  group	
  statement	
  is	
  (key,	
  bag),	
  where	
  key	
  is	
  the	
  
group	
  key	
  and	
  bag	
  contains	
  a	
  tuple	
  for	
  every	
  record	
  with	
  that	
  
key	
  
	
  

arrests	
  =	
  LOAD	
  'arrests.csv'	
  USING	
  PigStorage(',')	
  AS(year:int,	
  team:chararray,	
  player:chararray);	
  
arrests_by_team	
  =	
  GROUP	
  arrests	
  BY	
  team;	
  

13

©2011 Cloudera, Inc. All Rights
Reserved.
Grouping & Types	
  

•  GROUP	
  BY	
  makes	
  an	
  output	
  bag	
  containing	
  tuples,	
  containing	
  
more	
  bags	
  
Gprd = GROUP arrests BY user;

•  In:	
  BagOf(year,	
  team,	
  player)	
  
	
  

•  Out:	
  BagOf(group,	
  BagOf(year,	
  team,	
  player),	
  named	
  arrests)	
  
	
  

•  The	
  grouping	
  item	
  is	
  always	
  named	
  “group”	
  

14
GROUP	
  arrests	
  BY	
  team;	
  
(2010,TEN,Derrick	
  Morgan)	
  
(2010,TEN,Vince	
  Young)	
  
(2010,TEN,Kenny	
  Bric)	
  
(2010,WAS,Fred	
  Davis)	
  
(2010,WAS,Albert	
  Haynesworth)	
  
(2010,WAS,Fred	
  Davis)	
  
(2010,WAS,Fred	
  Davis)	
  
(2010,WAS,Joe	
  Joseph)	
  

arrests	
  

15	
  

(TEN,	
  
{(2010,TEN,Derrick	
  Morgan),	
  
(2010,TEN,Vince	
  Young),	
  
(2010,TEN,Kenny	
  Bric)})	
  
	
  
(WAS,	
  
{(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Albert	
  Haynesworth),
(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Joe	
  Joseph)})	
  

(group,	
  arrests)	
  
CounLng	
  Arrests	
  by	
  Team	
  
num_arrests	
  =	
  FOREACH	
  arrests_by_team	
  GENERATE	
  group	
  AS	
  
team,	
  COUNT(arrests)	
  AS	
  total;	
  

(TEN,	
  
{(2010,TEN,Derrick	
  Morgan),	
  
(2010,TEN,Vince	
  Young),	
  
(2010,TEN,Kenny	
  Bric)})	
  
	
  
(WAS,	
  
{(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Albert	
  Haynesworth),
(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Fred	
  Davis),	
  
(2010,WAS,Joe	
  Joseph)})	
  
16	
  

	
  
	
  
Results:	
  
(SEA,20)	
  
(STL,9)	
  
(TEN,31)	
  
(WAS,16)	
  
	
  
	
  
	
  
	
  
Using Types	
  
• 
• 

By	
  default	
  Pig	
  treats	
  data	
  as	
  un-­‐typed	
  
User	
  can	
  declare	
  types	
  of	
  data	
  at	
  load	
  Lme	
  

	
  

arrests = LOAD 'arrests.csv' USING PigStorage(',')
AS(year:int, team:chararray, player:chararray);

•  If	
  data	
  type	
  is	
  not	
  declared	
  but	
  script	
  treats	
  value	
  as	
  a	
  certain	
  
type,	
  Pig	
  will	
  assume	
  it	
  is	
  of	
  that	
  type	
  and	
  cast	
  it	
  
	
  

arrests = LOAD 'arrests.csv' USING PigStorage(',')
AS(year, team, player);
Two_digit_year = FOREACH arrests
GENERATE year - 2000; -- cast to int

17
Ordering	
  
ordered_arrest_count = ORDER num_arrests BY total;

•  Sort	
  the	
  teams	
  by	
  total	
  number	
  of	
  arrests	
  
	
  
	
  

18

©2011 Cloudera, Inc. All Rights
Reserved.
Filtering	
  
•  Now	
  let’s	
  apply	
  a	
  filter	
  to	
  the	
  teams	
  so	
  that	
  we	
  only	
  get	
  the	
  
baddest	
  of	
  the	
  bad	
  	
  
subset = FILTER ordered_num_arrests BY (total>20)

Results:	
  
(KC,21)
(SD,21)
(CHI,23)
(IND,23)
(MIA,24)
(TB,25)
(JAC,25)
(DEN,30)
(TEN,31)
(CIN,32)
(MIN,38)
19

©2011 Cloudera, Inc. All Rights
Reserved.
Loading Data
•  Welcome	
  to	
  the	
  Pig	
  data	
  loader	
  
	
  

•  PigStorage:	
  loads/stores	
  relaLons	
  using	
  field-­‐delimited	
  text	
  format.	
  
	
  

•  BinStorage:	
  loads/stores	
  relaLons	
  from	
  or	
  to	
  binary	
  files	
  
	
  

•  BinaryStorage:	
  loads/stores	
  relaLons	
  containing	
  only	
  single-­‐field	
  
tuples	
  with	
  a	
  value	
  of	
  type	
  bytearray	
  
	
  

•  TextLoader:	
  Loads	
  relaLons	
  from	
  a	
  plain-­‐text	
  format	
  
	
  

•  PigDump:	
  Stores	
  relaLons	
  by	
  wriLng	
  the	
  toString()	
  represetaLon	
  of	
  
tuples,	
  one	
  per	
  line.	
  

20

©2011 Cloudera, Inc. All Rights
Reserved.
Sharing	
  Metadata	
  
•  Use

HCatalog
$ pig –useHCatalog
grunt> playbyplay= LOAD ’playbyplay' USING
org.apache.hcatalog.pig.HCatLoader();
grunt> STORE newdata INTO ’newtable' USING
org.apache.hcatalog.pig.HCatStorer();

• 

*need to upload some jars to enable this:
• 

http://blog.cloudera.com/blog/2013/08/demo-using-hue-toaccess-hive-data-through-pig/
User Defined Functions	
  
•  Pig	
  provides	
  two	
  statements:	
  Register	
  &	
  Define	
  
•  Register:	
  Registers	
  a	
  JAR	
  file	
  with	
  the	
  Pig	
  runLme	
  
•  Define:	
  Creates	
  an	
  alias	
  for	
  a	
  UDF	
  or	
  streaming	
  script	
  
•  Can	
  be	
  used	
  to	
  do	
  column	
  transformaLon,	
  filtering,	
  ordering,	
  
custom	
  aggregaLon.	
  
•  For	
  example,	
  you	
  want	
  to	
  write	
  custom	
  logic	
  to	
  do	
  user	
  
session	
  analysis:	
  
	
  

log

= LOAD ‘excite-small.log’
AS (user, time, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE
group, SessionAnalysis(log);
STORE cntd INTO ‘output’;

22

©2011 Cloudera, Inc. All Rights
Reserved.
Try it!
$ pig
grunt> arrests = LOAD 'arrests.csv' USING PigStorage(',') AS
(year,team,player);
grunt> grouped_arrests = GROUP arrests BY team;
grunt> num_arrests = FOREACH grouped_arrests GENERATE group AS
team, COUNT(arrests) AS total;
grunt> ordered_arrests = ORDER num_arrests BY total;
grunt> bad_boys = FILTER ordered_arrests BY (total>20);

23
24

Más contenido relacionado

Was ist angesagt?

龍華大學前端技術分享 Part1
龍華大學前端技術分享 Part1龍華大學前端技術分享 Part1
龍華大學前端技術分享 Part1Jia-Long Yeh
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...DevOpsDays Tel Aviv
 
SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015Mark Baker
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Scala - den smarta kusinen
Scala - den smarta kusinenScala - den smarta kusinen
Scala - den smarta kusinenRedpill Linpro
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189Mahmoud Samir Fayed
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudMongoDB
 
Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!Acunu
 
Advanced Redis data structures
Advanced Redis data structuresAdvanced Redis data structures
Advanced Redis data structuresamix3k
 
아파트 정보를 이용한 ELK stack 활용 - 오근문
아파트 정보를 이용한 ELK stack 활용 - 오근문아파트 정보를 이용한 ELK stack 활용 - 오근문
아파트 정보를 이용한 ELK stack 활용 - 오근문NAVER D2
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
The Internet Is a Series of Tubes
The Internet Is a Series of TubesThe Internet Is a Series of Tubes
The Internet Is a Series of TubesJohn Hobbs
 
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...Kory Becker
 
Presto Overfview
Presto OverfviewPresto Overfview
Presto OverfviewMiguel Ping
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica SetsMongoDB
 
SPL: The Undiscovered Library - DataStructures
SPL: The Undiscovered Library -  DataStructuresSPL: The Undiscovered Library -  DataStructures
SPL: The Undiscovered Library - DataStructuresMark Baker
 

Was ist angesagt? (20)

龍華大學前端技術分享 Part1
龍華大學前端技術分享 Part1龍華大學前端技術分享 Part1
龍華大學前端技術分享 Part1
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
 
SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015
 
Presentatie - Introductie in Groovy
Presentatie - Introductie in GroovyPresentatie - Introductie in Groovy
Presentatie - Introductie in Groovy
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Scala - den smarta kusinen
Scala - den smarta kusinenScala - den smarta kusinen
Scala - den smarta kusinen
 
The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189The Ring programming language version 1.6 book - Part 42 of 189
The Ring programming language version 1.6 book - Part 42 of 189
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!Cassandra deep-dive @ NoSQLNow!
Cassandra deep-dive @ NoSQLNow!
 
Advanced Redis data structures
Advanced Redis data structuresAdvanced Redis data structures
Advanced Redis data structures
 
아파트 정보를 이용한 ELK stack 활용 - 오근문
아파트 정보를 이용한 ELK stack 활용 - 오근문아파트 정보를 이용한 ELK stack 활용 - 오근문
아파트 정보를 이용한 ELK stack 활용 - 오근문
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
The Internet Is a Series of Tubes
The Internet Is a Series of TubesThe Internet Is a Series of Tubes
The Internet Is a Series of Tubes
 
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
 
Presto Overfview
Presto OverfviewPresto Overfview
Presto Overfview
 
Webinar: Replication and Replica Sets
Webinar: Replication and Replica SetsWebinar: Replication and Replica Sets
Webinar: Replication and Replica Sets
 
SPL: The Undiscovered Library - DataStructures
SPL: The Undiscovered Library -  DataStructuresSPL: The Undiscovered Library -  DataStructures
SPL: The Undiscovered Library - DataStructures
 
Intro to The PHP SPL
Intro to The PHP SPLIntro to The PHP SPL
Intro to The PHP SPL
 
MongoDB-SESSION03
MongoDB-SESSION03MongoDB-SESSION03
MongoDB-SESSION03
 

Andere mochten auch

Hc 16- presentatie humanisering en zingeving - pdw
Hc 16- presentatie humanisering en zingeving - pdwHc 16- presentatie humanisering en zingeving - pdw
Hc 16- presentatie humanisering en zingeving - pdwHVMiddenHolland
 
April Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionApril Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionRyan Bosshart
 
Verslag humanistisch lab 20 oktober HV MiddenHolland
Verslag humanistisch lab 20 oktober HV MiddenHollandVerslag humanistisch lab 20 oktober HV MiddenHolland
Verslag humanistisch lab 20 oktober HV MiddenHollandHVMiddenHolland
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
The Skywoods E-brochure
The Skywoods E-brochureThe Skywoods E-brochure
The Skywoods E-brochureAlan Low
 
The introduction of the course
The introduction of the courseThe introduction of the course
The introduction of the courseliyanyi
 
Intercultural communication the first week
Intercultural communication the first weekIntercultural communication the first week
Intercultural communication the first weekliyanyi
 
Hadoop hands on madison
Hadoop hands on madisonHadoop hands on madison
Hadoop hands on madisonRyan Bosshart
 
Guia de investigacion 1 cande de la torre
Guia de investigacion 1 cande de la torreGuia de investigacion 1 cande de la torre
Guia de investigacion 1 cande de la torrecandeladela
 
Diapos de ruptura uterina
Diapos de ruptura uterinaDiapos de ruptura uterina
Diapos de ruptura uterinasolartum
 
2012-13 Attendance Data
2012-13 Attendance Data2012-13 Attendance Data
2012-13 Attendance Datadcpcsb
 
ASTEC AMERICA - LETTER OF RECOMMENDATION
ASTEC AMERICA - LETTER OF RECOMMENDATIONASTEC AMERICA - LETTER OF RECOMMENDATION
ASTEC AMERICA - LETTER OF RECOMMENDATIONNicole Martin
 
RoyAscottSyncreticCybernetics (9thSHBiennial2012)
RoyAscottSyncreticCybernetics (9thSHBiennial2012)RoyAscottSyncreticCybernetics (9thSHBiennial2012)
RoyAscottSyncreticCybernetics (9thSHBiennial2012)Xiaoying Juliette Yuan
 

Andere mochten auch (18)

Hc 16- presentatie humanisering en zingeving - pdw
Hc 16- presentatie humanisering en zingeving - pdwHc 16- presentatie humanisering en zingeving - pdw
Hc 16- presentatie humanisering en zingeving - pdw
 
April Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionApril Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On Session
 
Verslag humanistisch lab 20 oktober HV MiddenHolland
Verslag humanistisch lab 20 oktober HV MiddenHollandVerslag humanistisch lab 20 oktober HV MiddenHolland
Verslag humanistisch lab 20 oktober HV MiddenHolland
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Gender Constructs - Snow white
Gender Constructs - Snow whiteGender Constructs - Snow white
Gender Constructs - Snow white
 
The Skywoods E-brochure
The Skywoods E-brochureThe Skywoods E-brochure
The Skywoods E-brochure
 
Magic
MagicMagic
Magic
 
The introduction of the course
The introduction of the courseThe introduction of the course
The introduction of the course
 
Intercultural communication the first week
Intercultural communication the first weekIntercultural communication the first week
Intercultural communication the first week
 
Language acquisition
Language acquisitionLanguage acquisition
Language acquisition
 
Hadoop hands on madison
Hadoop hands on madisonHadoop hands on madison
Hadoop hands on madison
 
Guia de investigacion 1 cande de la torre
Guia de investigacion 1 cande de la torreGuia de investigacion 1 cande de la torre
Guia de investigacion 1 cande de la torre
 
Diapos de ruptura uterina
Diapos de ruptura uterinaDiapos de ruptura uterina
Diapos de ruptura uterina
 
2012-13 Attendance Data
2012-13 Attendance Data2012-13 Attendance Data
2012-13 Attendance Data
 
ASTEC AMERICA - LETTER OF RECOMMENDATION
ASTEC AMERICA - LETTER OF RECOMMENDATIONASTEC AMERICA - LETTER OF RECOMMENDATION
ASTEC AMERICA - LETTER OF RECOMMENDATION
 
RoyAscottSyncreticCybernetics (9thSHBiennial2012)
RoyAscottSyncreticCybernetics (9thSHBiennial2012)RoyAscottSyncreticCybernetics (9thSHBiennial2012)
RoyAscottSyncreticCybernetics (9thSHBiennial2012)
 

Ähnlich wie Pig Hands On November

Dachis group pigout_101
Dachis group pigout_101Dachis group pigout_101
Dachis group pigout_101ktsafford
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
03 introduction to graph databases
03   introduction to graph databases03   introduction to graph databases
03 introduction to graph databasesNeo4j
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy PluginsPaul King
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
 
Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql databasebigdatagurus_meetup
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraDataStax
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to PigChris Wilkes
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Oscon Java Testing on the Fast Lane
Oscon Java Testing on the Fast LaneOscon Java Testing on the Fast Lane
Oscon Java Testing on the Fast LaneAndres Almiray
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 

Ähnlich wie Pig Hands On November (20)

Apache pig
Apache pigApache pig
Apache pig
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Dachis group pigout_101
Dachis group pigout_101Dachis group pigout_101
Dachis group pigout_101
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
03 introduction to graph databases
03   introduction to graph databases03   introduction to graph databases
03 introduction to graph databases
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and GrailsPhilip Stehlik at TechTalks.ph - Intro to Groovy and Grails
Philip Stehlik at TechTalks.ph - Intro to Groovy and Grails
 
Hypertable - massively scalable nosql database
Hypertable - massively scalable nosql databaseHypertable - massively scalable nosql database
Hypertable - massively scalable nosql database
 
Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)Rust: Reach Further (from QCon Sao Paolo 2018)
Rust: Reach Further (from QCon Sao Paolo 2018)
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
Unit testing pig
Unit testing pigUnit testing pig
Unit testing pig
 
Practical pig
Practical pigPractical pig
Practical pig
 
Pig Introduction to Pig
Pig Introduction to PigPig Introduction to Pig
Pig Introduction to Pig
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Oscon Java Testing on the Fast Lane
Oscon Java Testing on the Fast LaneOscon Java Testing on the Fast Lane
Oscon Java Testing on the Fast Lane
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 

Último

Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 

Último (20)

SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 

Pig Hands On November

  • 1. Hands-­‐on  Pig  with  the   NFL  Play  by  Play  Dataset   Headline  Goes  Here   Ryan  Bosshart    |    Systems  Engineer     Speaker   Nov  2013  v1  Name  or  Subhead  Goes  Here   1 DO  NOT  USE  PUBLICLY   PRIOR  TO  10/23/12  
  • 2. Outline   •  What  is  Pig     •  Pig  LaLn  by  Example     •  Data  Model/Architecture     •  Hands-­‐on  with  Pig   2
  • 3. What  is  Pig?     Give  me  every  run  in  the  2010  season:     SELECT  *  FROM  playbyplay  WHERE  playtype  =  "RUN”  and  year  =  2010;   playbyplay  =  LOAD  'playbyplay’  ….;   run_plays  =  FILTER  playbyplay  BY  (playtype=='RUN')  AND  (year==2010);   DUMP  run_plays;   3  
  • 4. Components   •  Pig  resides  on  user  machine   •  Job  submiced  to  cluster  &  executed  on  cluster     •  No  need  to  install  anything  extra  on  cluster   Hadoop   Cluster   Pig  Input   (Client  Machine)   4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Accessing Pig   •  Grunt,  the  pig  shell   •  Submit  a  script  directly   •  PigServer  Java  class,  a  JDBC  like  interface   •  Hue   •  Allows  textual  &  graphical  scripLng   5
  • 7. How  Pig  Works   Pig  La2n:  Count  Job     A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x> 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; 7 •  •  •  •  •  •  Parses   Checks   OpLmizes   Plans  execuLon   Submits  jar  to  Hadoop   Monitors  job  progress   ExecuLon  Plan   Map:  Filter   Reduce:  Counter  
  • 8. Starting Grunt   •  To  start  the  Pig  shell  (Grunt),  start  a  terminal  and  run   $ pig $ pig –x local   •  Should  see  a  prompt  like:   grunt> 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. Data Types   •  Scalar  types   •  •  •  •  •  •    Int   Long   Float   Double   Chararray   Bytearray   •  Complex  types   •  Map:  associaLve  array   •  Tuple:  ordered  list  of  data,  elements  may  be  of  any  scalar  or  complex  type   •  Bag:  unordered  collecLon  of  tuples   9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. Load Returns a Bag   •  LOAD  statements  return  a  tuple.  Each  tuple  has  mulLple   elements,  which  can  be  referenced  by  posiLon  or  by  name.     arrests  =  LOAD  'arrests.csv'  USING  PigStorage(',')  AS(year:int,  team:chararray,  player:chararray); •  A  set  of  tuples  is  referred  to  as  a  bag  (normally  unordered) (2002,KC,Willie  Roaf)   (2002,OAK,Darrell  Russell)   (2002,NYJ,Aaron  Beasley)   10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. Bags & FOREACH   •  The  FOREACH…GENERATE  statement  iterates  over  the  members   of  a  bag     Players = FOREACH arrests GENERATE player;   •  The  result  of  a  FOREACH  is  another  bag   •  Elements  are  named  as  the  input  bag   11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. Positional Reference   •  The  following  creates  idenLcal  output  data     Players = FOREACH arrests GENERATE $2;   •  …But  the  elements  of  arrests  aren’t  named  “player”  –  unless   you  do  this:     Players = FOREACH arrest GENERATE $2 AS player; 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Grouping   •  In  Pig  grouping  is  a  separate  operaLon  from  applying  aggregate   funcLons     •  The  output  of  the  group  statement  is  (key,  bag),  where  key  is  the   group  key  and  bag  contains  a  tuple  for  every  record  with  that   key     arrests  =  LOAD  'arrests.csv'  USING  PigStorage(',')  AS(year:int,  team:chararray,  player:chararray);   arrests_by_team  =  GROUP  arrests  BY  team;   13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Grouping & Types   •  GROUP  BY  makes  an  output  bag  containing  tuples,  containing   more  bags   Gprd = GROUP arrests BY user; •  In:  BagOf(year,  team,  player)     •  Out:  BagOf(group,  BagOf(year,  team,  player),  named  arrests)     •  The  grouping  item  is  always  named  “group”   14
  • 15. GROUP  arrests  BY  team;   (2010,TEN,Derrick  Morgan)   (2010,TEN,Vince  Young)   (2010,TEN,Kenny  Bric)   (2010,WAS,Fred  Davis)   (2010,WAS,Albert  Haynesworth)   (2010,WAS,Fred  Davis)   (2010,WAS,Fred  Davis)   (2010,WAS,Joe  Joseph)   arrests   15   (TEN,   {(2010,TEN,Derrick  Morgan),   (2010,TEN,Vince  Young),   (2010,TEN,Kenny  Bric)})     (WAS,   {(2010,WAS,Fred  Davis),   (2010,WAS,Albert  Haynesworth), (2010,WAS,Fred  Davis),   (2010,WAS,Fred  Davis),   (2010,WAS,Joe  Joseph)})   (group,  arrests)  
  • 16. CounLng  Arrests  by  Team   num_arrests  =  FOREACH  arrests_by_team  GENERATE  group  AS   team,  COUNT(arrests)  AS  total;   (TEN,   {(2010,TEN,Derrick  Morgan),   (2010,TEN,Vince  Young),   (2010,TEN,Kenny  Bric)})     (WAS,   {(2010,WAS,Fred  Davis),   (2010,WAS,Albert  Haynesworth), (2010,WAS,Fred  Davis),   (2010,WAS,Fred  Davis),   (2010,WAS,Joe  Joseph)})   16       Results:   (SEA,20)   (STL,9)   (TEN,31)   (WAS,16)          
  • 17. Using Types   •  •  By  default  Pig  treats  data  as  un-­‐typed   User  can  declare  types  of  data  at  load  Lme     arrests = LOAD 'arrests.csv' USING PigStorage(',') AS(year:int, team:chararray, player:chararray); •  If  data  type  is  not  declared  but  script  treats  value  as  a  certain   type,  Pig  will  assume  it  is  of  that  type  and  cast  it     arrests = LOAD 'arrests.csv' USING PigStorage(',') AS(year, team, player); Two_digit_year = FOREACH arrests GENERATE year - 2000; -- cast to int 17
  • 18. Ordering   ordered_arrest_count = ORDER num_arrests BY total; •  Sort  the  teams  by  total  number  of  arrests       18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. Filtering   •  Now  let’s  apply  a  filter  to  the  teams  so  that  we  only  get  the   baddest  of  the  bad     subset = FILTER ordered_num_arrests BY (total>20) Results:   (KC,21) (SD,21) (CHI,23) (IND,23) (MIA,24) (TB,25) (JAC,25) (DEN,30) (TEN,31) (CIN,32) (MIN,38) 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. Loading Data •  Welcome  to  the  Pig  data  loader     •  PigStorage:  loads/stores  relaLons  using  field-­‐delimited  text  format.     •  BinStorage:  loads/stores  relaLons  from  or  to  binary  files     •  BinaryStorage:  loads/stores  relaLons  containing  only  single-­‐field   tuples  with  a  value  of  type  bytearray     •  TextLoader:  Loads  relaLons  from  a  plain-­‐text  format     •  PigDump:  Stores  relaLons  by  wriLng  the  toString()  represetaLon  of   tuples,  one  per  line.   20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Sharing  Metadata   •  Use HCatalog $ pig –useHCatalog grunt> playbyplay= LOAD ’playbyplay' USING org.apache.hcatalog.pig.HCatLoader(); grunt> STORE newdata INTO ’newtable' USING org.apache.hcatalog.pig.HCatStorer(); •  *need to upload some jars to enable this: •  http://blog.cloudera.com/blog/2013/08/demo-using-hue-toaccess-hive-data-through-pig/
  • 22. User Defined Functions   •  Pig  provides  two  statements:  Register  &  Define   •  Register:  Registers  a  JAR  file  with  the  Pig  runLme   •  Define:  Creates  an  alias  for  a  UDF  or  streaming  script   •  Can  be  used  to  do  column  transformaLon,  filtering,  ordering,   custom  aggregaLon.   •  For  example,  you  want  to  write  custom  logic  to  do  user   session  analysis:     log = LOAD ‘excite-small.log’ AS (user, time, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group, SessionAnalysis(log); STORE cntd INTO ‘output’; 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Try it! $ pig grunt> arrests = LOAD 'arrests.csv' USING PigStorage(',') AS (year,team,player); grunt> grouped_arrests = GROUP arrests BY team; grunt> num_arrests = FOREACH grouped_arrests GENERATE group AS team, COUNT(arrests) AS total; grunt> ordered_arrests = ORDER num_arrests BY total; grunt> bad_boys = FILTER ordered_arrests BY (total>20); 23
  • 24. 24