SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Data Manipulation with Pig
Page 1
Wes Floyd - @weswfloyd
Page 2
Pig History
• Born from Yahoo Research then Apache incubated
• Built to avoid low level programming of Map/Reduce
without Hive/SQL queries
• Committers from: Yahoo, Hortonworks, LinkedIn,
SalesForce, IBM, Twitter, Netflix, and others
• Alan Gates on Pig
Page 3
Pig
• An engine for executing programs on top of
Hadoop
• It provides a language, Pig Latin, to specify these
programs
Page 4
HDP: Enterprise Hadoop Platform
Page 5
Hortonworks
Data Platform (HDP)
•  The ONLY 100% open source
and complete platform
•  Integrates full range of
enterprise-ready services
•  Certified and tested at scale
•  Engineered for deep
ecosystem interoperability
OS/VM	
   Cloud	
   Appliance	
  
PLATFORM	
  	
  
SERVICES	
  
HADOOP	
  	
  
CORE	
  
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  
OPERATIONAL	
  
SERVICES	
  
DATA	
  
SERVICES	
  
HDFS	
  
SQOOP	
  
FLUME	
  
NFS	
  
LOAD	
  &	
  	
  
EXTRACT	
  
WebHDFS	
  
KNOX*	
  
OOZIE	
  
AMBARI	
  
FALCON*	
  
YARN	
  	
  	
  
MAP	
  	
  
	
  
TEZ	
  REDUCE	
  
HIVE	
  &	
  
HCATALOG	
  
PIG	
  HBASE	
  
Why use Pig?
• Suppose you have user data in one file, website data
in another, and you need to find the top 5 most visited
sites by users aged 18 - 25
Page 6
In Map-Reduce
Page 7
170 lines of code, 4 hours to write
In Pig Latin
Users	
  =	
  load	
  ‘input/users’	
  using	
  PigStorage(‘,’)	
  as	
  (name:chararray,	
  age:int);	
  
Fltrd	
  =	
  filter	
  Users	
  by	
  age	
  >=	
  18	
  and	
  age	
  <=	
  25;	
  
Pages	
  =	
  load	
  ‘input/pages’	
  using	
  PigStorage(‘,’)	
  	
  as	
  (user:chararray,	
  
url:chararray);	
  
Jnd	
  =	
  join	
  Fltrd	
  by	
  name,	
  Pages	
  by	
  user;	
  
Grpd	
  =	
  group	
  Jnd	
  by	
  url;	
  
Smmd	
  =	
  foreach	
  Grpd	
  generate	
  group,COUNT(Jnd)	
  as	
  clicks;	
  
Srtd	
  =	
  order	
  Smmd	
  by	
  clicks	
  desc;	
  
Top5	
  =	
  limit	
  Srtd	
  5;	
  
store	
  Top5	
  into	
  ‘output/top5sites’	
  using	
  PigStorage(‘,’);	
  
Page 8
9 lines of code, 15 minutes to write
170 lines to 9 lines of code
Essence of Pig
• Map-Reduce is too low a level, SQL too high
• Pig-Latin, a language intended to sit between the two
– Provides standard relational transforms (join, sort, etc.)
– Schemas are optional, used when available, can be defined at
runtime
– User Defined Functions are first class citizens
Page 9
Pig Architecture
10
Hadoop
Pig Client:
Parses, validates, optimizes, plans, coordinates execution
Data stored in HDFS
Processing done via MapReduce
Pig Elements
Page 11
•  High-level scripting language
•  Requires no metadata or schema
•  Statements translated into a series of
MapReduce jobs
Pig Latin
•  Interactive shellGrunt
•  Shared repository for User Defined
Functions (UDFs)Piggybank
Pig Latin Data Flow
Page 12
LOAD
(HDFS/HCat)
TRANSFORM
(Pig)
DUMP or
STORE
(HDFS/HCAT)
Read data to be
manipulated from the
file system
Manipulate the
data
Output data to the
screen or store for
processing
In code:
•  VARIABLE1	
  =	
  LOAD	
  [somedata]	
  
•  VARIABLE2	
  =	
  [TRANSFORM	
  operation]	
  
•  STORE	
  VARIABLE2	
  INTO	
  ‘[some	
  location]’	
  
Pig Relations
1.  A bag is a collection of
unordered tuples
(can be different sizes).
2.  A tuple is an ordered
set of fields.
3.  A field is a piece of data.
Pig Latin statements work with relations
Field
Field 1
Field
2
Field 3
Tuple
Bag
FILTER, GROUP, FOREACH, ORDER
Page 14
logevents	
  =	
  LOAD	
  'input/my.log'	
  AS	
  (date:chararray,	
  
	
  level:chararray,	
  code:int,	
  message:chararray);	
  
severe	
  =	
  FILTER	
  logevents	
  BY	
  (level	
  ==	
  'severe’	
  
	
  AND	
  	
  code	
  >=	
  500);	
  
grouped	
  =	
  GROUP	
  severe	
  BY	
  code;	
  
e1	
  =	
  LOAD	
  'pig/input/File1'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,age:int,	
  
zip:int,salary:double);	
  
f	
  =	
  FOREACH	
  e1	
  GENERATE	
  age,	
  salary;	
  
g	
  =	
  ORDER	
  f	
  BY	
  age	
  
JOIN, GROUP, LIMIT
Page 15
employees	
  =	
  LOAD	
  ‘[somefile]’	
  
	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,age:int,	
  zip:int,salary:double);	
  
agegroup	
  =	
  GROUP	
  employees	
  BY	
  age;	
  	
  	
  	
  	
  	
  	
  	
  	
  
h	
  =	
  LIMIT	
  agegroup	
  100;	
  
e1	
  =	
  LOAD	
  ’[somefile]'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,	
  age:int,	
  zip:int,	
  
salary:double);	
  
e2	
  =	
  LOAD	
  '[somefile]'	
  USING	
  PigStorage(',')	
  	
  
	
  	
  	
  	
  	
  	
  	
  AS	
  (name:chararray,	
  phone:chararray);	
  
e3	
  =	
  JOIN	
  e1	
  BY	
  name,	
  e2	
  BY	
  name;	
  
Pig Basics Demo
Page 16
Grunt Command Line Demo
Page 17
Hive vs Pig
Page 18
Pig and Hive work well together
and many businesses use both.
Hive is a good choice:
•  when you want to query the data
•  when you need an answer to specific
questions
•  if you are familiar with SQL
Pig is a good choice:
•  for ETL (Extract -> Transform -> Load)
•  for preparing data for easier analysis
•  when you have a long series of steps to
perform
Tool Comparison
Page 19Š Hortonworks 2012
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists,
char, varchar,
decimal, …
Schema Encoded in app Declared in script
or read by loader
Read from
metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata
T-SQL vs Hadoop Ecosystem
Page 20
T-SQL PIG Hive
Query Data Yes Yes (in bulk) Yes
Local Variables Yes Yes No
Conditional Logic Yes limited limited
Procedural
Programming
Yes No No
UDFs No Yes Yes
HCatalog: Data Sharing is Hard
Page 21
Photo Credit: totalAldo via Flickr
This is programmer Bob, he
uses Pig to crunch data.
This is analyst Joe, he uses Hive
to build reports and answer ad-hoc
queries.
Hmm, is it done yet? Where is it? What format
did you use to store it today? Is it compressed?
And can you help me load it into Hive, I can
never remember all the parameters I have to
pass that alter table command.
Ok
Bob, I need
today’s data
Dude, we need
HCatalog
Š Hortonworks Inc. 2012
Pig Example
Page 22
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Š Hortonworks 2013
Pig Example
Page 23
Assume you want to count how many time each of your users went to each of
your URLs
raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Using HCatalog:
raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
No need to know
file location
No need to
declare schema
Partition filter
Š Hortonworks 2013
Tools With HCatalog
Page 24
Feature MapReduce +
HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record
Data model int, float, string,
maps, structs, lists
int, float, string,
bytes, maps,
tuples, bags
int, float, string,
maps, structs, lists
Schema Read from
metadata
Read from
metadata
Read from
metadata
Data location Read from
metadata
Read from
metadata
Read from
metadata
Data format Read from
metadata
Read from
metadata
Read from
metadata
•  Pig/MR users can read schema from metadata
•  Pig/MR users are insulated from schema, location, and format changes
•  All users have access to other users’ data as soon as it is committed
Pig with HCat Demo
Page 25
Data & Metadata REST Services APIs
Page 26
HDFS HBase
External
Store
Existing & New Applications
MapReduce Pig Hive
HCatalog
WebHCat RESTful Web Services
WebHDFS & WebHCat
provide RESTful API as
“front door” for Hadoop
•  Opens the door to
languages other than Java
•  Thin clients via web
services vs. fat-clients in
gateway
•  Insulation from interface
changes release to release
Opens Hadoop to integration with existing and new applications
WebHDFS
RESTful API Access for Pig
• Code example
	
  curl	
  -­‐s	
  -­‐d	
  user.name=hue	
  	
  
	
  	
  	
  	
  	
  	
  	
  -­‐d	
  execute=”<pig	
  script>”	
  	
  
	
  	
  	
  	
  	
  	
  	
  'http://localhost:50111/templeton/v1/pig'
•  RestSharp (restsharp.org/)
– Simple REST and HTTP API Client
for .NET
Page 27
WebHCat REST API
Page 28
Page 28Š Hortonworks 2012
Hadoop/
HCatalog
Get a list of all tables in the default database:
GET
http://…/v1/ddl/database/default/table
{
"tables": ["counted","processed",],
"database": "default"
}
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}
http://…/v1/ddl/database/default/table/rawevents
{
"table": "rawevents",
"database": "default”
}{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
Pig with WebHCat Demo
Page 29
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
Page 30
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
Pig on Tez - Design
3
Logical Plan
Tez Plan MR Plan
Physical Plan
Tez Execution Engine MR Execution Engine
LogToPhyTranslationVisitor
MRCompilerTezCompiler
Performance numbers
3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Replicated
Join (2.8x)
Join +
Groupby
(1.5x)
Join +
Groupby +
Orderby
(1.5x)
3 way Split +
Join +
Groupby +
Orderby
(2.6x)
Timeinsecs
MR
Tez
User Defined Functions
• Ultimate in extensibility and portability
• Custom processing
– Java
– Python
– JavaScript
– Ruby
• Integration with MapReduce phases
– Map
– Combine
– Reduce
User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Algebraic {!
…!
}!
• Algebraic functions
• 3-phase execution
– Map – called once for each tuple
– Combiner – called zero or more times for each map result
– Reduce
User Defined Functions
public class MyUDF extends EvalFunc<DataBag>
implements Accumulator {!
…!
}!
• Accmulator functions
• Incremental processing of data
• Called in both map and reduce phase
User Defined Functions
public class MyUDF extends FilterFunc {!
…!
}!
• Filter functions
• Returns boolean based on processing of the tuple
• Called in both map and reduce phase
Questions & Answers
Page 37

Weitere ähnliche Inhalte

Was ist angesagt?

Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
Hortonworks
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
 

Was ist angesagt? (20)

Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
 

Andere mochten auch

HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 

Andere mochten auch (7)

A glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big DataA glimpse into the Future of Hadoop & Big Data
A glimpse into the Future of Hadoop & Big Data
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 
HCatalog & Templeton
HCatalog & TempletonHCatalog & Templeton
HCatalog & Templeton
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 

Ähnlich wie Sql saturday pig session (wes floyd) v2

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
Hortonworks
 

Ähnlich wie Sql saturday pig session (wes floyd) v2 (20)

Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVisProcess and Visualize Your Data with Revolution R, Hadoop and GoogleVis
Process and Visualize Your Data with Revolution R, Hadoop and GoogleVis
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 

KĂźrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
Christopher Logan Kennedy
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

KĂźrzlich hochgeladen (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Sql saturday pig session (wes floyd) v2

  • 1. Data Manipulation with Pig Page 1 Wes Floyd - @weswfloyd
  • 3. Pig History • Born from Yahoo Research then Apache incubated • Built to avoid low level programming of Map/Reduce without Hive/SQL queries • Committers from: Yahoo, Hortonworks, LinkedIn, SalesForce, IBM, Twitter, Netflix, and others • Alan Gates on Pig Page 3
  • 4. Pig • An engine for executing programs on top of Hadoop • It provides a language, Pig Latin, to specify these programs Page 4
  • 5. HDP: Enterprise Hadoop Platform Page 5 Hortonworks Data Platform (HDP) •  The ONLY 100% open source and complete platform •  Integrates full range of enterprise-ready services •  Certified and tested at scale •  Engineered for deep ecosystem interoperability OS/VM   Cloud   Appliance   PLATFORM     SERVICES   HADOOP     CORE   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   OPERATIONAL   SERVICES   DATA   SERVICES   HDFS   SQOOP   FLUME   NFS   LOAD  &     EXTRACT   WebHDFS   KNOX*   OOZIE   AMBARI   FALCON*   YARN       MAP       TEZ  REDUCE   HIVE  &   HCATALOG   PIG  HBASE  
  • 6. Why use Pig? • Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18 - 25 Page 6
  • 7. In Map-Reduce Page 7 170 lines of code, 4 hours to write
  • 8. In Pig Latin Users  =  load  ‘input/users’  using  PigStorage(‘,’)  as  (name:chararray,  age:int);   Fltrd  =  filter  Users  by  age  >=  18  and  age  <=  25;   Pages  =  load  ‘input/pages’  using  PigStorage(‘,’)    as  (user:chararray,   url:chararray);   Jnd  =  join  Fltrd  by  name,  Pages  by  user;   Grpd  =  group  Jnd  by  url;   Smmd  =  foreach  Grpd  generate  group,COUNT(Jnd)  as  clicks;   Srtd  =  order  Smmd  by  clicks  desc;   Top5  =  limit  Srtd  5;   store  Top5  into  ‘output/top5sites’  using  PigStorage(‘,’);   Page 8 9 lines of code, 15 minutes to write 170 lines to 9 lines of code
  • 9. Essence of Pig • Map-Reduce is too low a level, SQL too high • Pig-Latin, a language intended to sit between the two – Provides standard relational transforms (join, sort, etc.) – Schemas are optional, used when available, can be defined at runtime – User Defined Functions are first class citizens Page 9
  • 10. Pig Architecture 10 Hadoop Pig Client: Parses, validates, optimizes, plans, coordinates execution Data stored in HDFS Processing done via MapReduce
  • 11. Pig Elements Page 11 •  High-level scripting language •  Requires no metadata or schema •  Statements translated into a series of MapReduce jobs Pig Latin •  Interactive shellGrunt •  Shared repository for User Defined Functions (UDFs)Piggybank
  • 12. Pig Latin Data Flow Page 12 LOAD (HDFS/HCat) TRANSFORM (Pig) DUMP or STORE (HDFS/HCAT) Read data to be manipulated from the file system Manipulate the data Output data to the screen or store for processing In code: •  VARIABLE1  =  LOAD  [somedata]   •  VARIABLE2  =  [TRANSFORM  operation]   •  STORE  VARIABLE2  INTO  ‘[some  location]’  
  • 13. Pig Relations 1.  A bag is a collection of unordered tuples (can be different sizes). 2.  A tuple is an ordered set of fields. 3.  A field is a piece of data. Pig Latin statements work with relations Field Field 1 Field 2 Field 3 Tuple Bag
  • 14. FILTER, GROUP, FOREACH, ORDER Page 14 logevents  =  LOAD  'input/my.log'  AS  (date:chararray,    level:chararray,  code:int,  message:chararray);   severe  =  FILTER  logevents  BY  (level  ==  'severe’    AND    code  >=  500);   grouped  =  GROUP  severe  BY  code;   e1  =  LOAD  'pig/input/File1'  USING  PigStorage(',')                AS  (name:chararray,age:int,   zip:int,salary:double);   f  =  FOREACH  e1  GENERATE  age,  salary;   g  =  ORDER  f  BY  age  
  • 15. JOIN, GROUP, LIMIT Page 15 employees  =  LOAD  ‘[somefile]’              AS  (name:chararray,age:int,  zip:int,salary:double);   agegroup  =  GROUP  employees  BY  age;                   h  =  LIMIT  agegroup  100;   e1  =  LOAD  ’[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  age:int,  zip:int,   salary:double);   e2  =  LOAD  '[somefile]'  USING  PigStorage(',')                  AS  (name:chararray,  phone:chararray);   e3  =  JOIN  e1  BY  name,  e2  BY  name;  
  • 17. Grunt Command Line Demo Page 17
  • 18. Hive vs Pig Page 18 Pig and Hive work well together and many businesses use both. Hive is a good choice: •  when you want to query the data •  when you need an answer to specific questions •  if you are familiar with SQL Pig is a good choice: •  for ETL (Extract -> Transform -> Load) •  for preparing data for easier analysis •  when you have a long series of steps to perform
  • 19. Tool Comparison Page 19Š Hortonworks 2012 Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists, char, varchar, decimal, … Schema Encoded in app Declared in script or read by loader Read from metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata
  • 20. T-SQL vs Hadoop Ecosystem Page 20 T-SQL PIG Hive Query Data Yes Yes (in bulk) Yes Local Variables Yes Yes No Conditional Logic Yes limited limited Procedural Programming Yes No No UDFs No Yes Yes
  • 21. HCatalog: Data Sharing is Hard Page 21 Photo Credit: totalAldo via Flickr This is programmer Bob, he uses Pig to crunch data. This is analyst Joe, he uses Hive to build reports and answer ad-hoc queries. Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Ok Bob, I need today’s data Dude, we need HCatalog Š Hortonworks Inc. 2012
  • 22. Pig Example Page 22 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; Š Hortonworks 2013
  • 23. Pig Example Page 23 Assume you want to count how many time each of your users went to each of your URLs raw = load '/data/rawevents/20120530' as (url, user); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; Using HCatalog: raw = load 'rawevents' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530'; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); No need to know file location No need to declare schema Partition filter Š Hortonworks 2013
  • 24. Tools With HCatalog Page 24 Feature MapReduce + HCatalog Pig + HCatalog Hive Record format Record Tuple Record Data model int, float, string, maps, structs, lists int, float, string, bytes, maps, tuples, bags int, float, string, maps, structs, lists Schema Read from metadata Read from metadata Read from metadata Data location Read from metadata Read from metadata Read from metadata Data format Read from metadata Read from metadata Read from metadata •  Pig/MR users can read schema from metadata •  Pig/MR users are insulated from schema, location, and format changes •  All users have access to other users’ data as soon as it is committed
  • 25. Pig with HCat Demo Page 25
  • 26. Data & Metadata REST Services APIs Page 26 HDFS HBase External Store Existing & New Applications MapReduce Pig Hive HCatalog WebHCat RESTful Web Services WebHDFS & WebHCat provide RESTful API as “front door” for Hadoop •  Opens the door to languages other than Java •  Thin clients via web services vs. fat-clients in gateway •  Insulation from interface changes release to release Opens Hadoop to integration with existing and new applications WebHDFS
  • 27. RESTful API Access for Pig • Code example  curl  -­‐s  -­‐d  user.name=hue                  -­‐d  execute=”<pig  script>”                  'http://localhost:50111/templeton/v1/pig' •  RestSharp (restsharp.org/) – Simple REST and HTTP API Client for .NET Page 27
  • 28. WebHCat REST API Page 28 Page 28Š Hortonworks 2012 Hadoop/ HCatalog Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table { "tables": ["counted","processed",], "database": "default" } •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents { "table": "rawevents", "database": "default” }{ "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }
  • 29. Pig with WebHCat Demo Page 29
  • 30. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez Page 30 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 31. Pig on Tez - Design 3 Logical Plan Tez Plan MR Plan Physical Plan Tez Execution Engine MR Execution Engine LogToPhyTranslationVisitor MRCompilerTezCompiler
  • 32. Performance numbers 3 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Replicated Join (2.8x) Join + Groupby (1.5x) Join + Groupby + Orderby (1.5x) 3 way Split + Join + Groupby + Orderby (2.6x) Timeinsecs MR Tez
  • 33. User Defined Functions • Ultimate in extensibility and portability • Custom processing – Java – Python – JavaScript – Ruby • Integration with MapReduce phases – Map – Combine – Reduce
  • 34. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Algebraic {! …! }! • Algebraic functions • 3-phase execution – Map – called once for each tuple – Combiner – called zero or more times for each map result – Reduce
  • 35. User Defined Functions public class MyUDF extends EvalFunc<DataBag> implements Accumulator {! …! }! • Accmulator functions • Incremental processing of data • Called in both map and reduce phase
  • 36. User Defined Functions public class MyUDF extends FilterFunc {! …! }! • Filter functions • Returns boolean based on processing of the tuple • Called in both map and reduce phase