MapReduce Application Scripting

8: MapReduce Application Scripting
Zubair Nabi
zubair.nabi@itu.edu.pk
May 25, 2013
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28

Outline
1 Pig Latin
2 Cascading

Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code

Introduction
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!

Introduction
code
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce

Introduction
code
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that are
executed atop Hadoop

SQL query to ﬁnd average pagerank for each large category
of URLs
1 SELECT category , AVG(pagerank)
2 FROM urls WHERE pagerank > 0.2
3 GROUP BY category HAVING COUNT(∗) > 10^6

Equivalent Pig query
1 good_urls = FILTER urls BY pagerank > 0.2;
2 groups = GROUP good_urls BY category;
3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;
4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);

Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages

Pig Interface
In contrast, SQL consists of declarative constraints that collectively
deﬁne the result

Pig Interface
deﬁne the result
Each step carries out a single data transformation

Pig Interface
deﬁne the result
A Pig Latin program is similar to specifying a query execution or a
dataﬂow graph

Pig Interface
define the result
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Due to this dataflow model, it is easier for programmers to understand
and control how their data processing task is executed

Features
Support for a fully nested data model with complex data types

Features
Extensive support for user-deﬁned functions

Features
Ability to operate over plain, schema-less input ﬁles

Features
Ability to operate over plain, schema-less input ﬁles
Open-source Apache project

Interoperability
Queries can be performed atop raw data dumps directly

Interoperability
The user needs to provide a function to parse the content of the ﬁle into
tuples

Interoperability
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence

Interoperability
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Datasets can be laid across diverse data storage sources and
applications

UDFs as ﬁrst-class citizens
A signiﬁcant part of large-scale data analysis relies on custom
processing

processing
For instance, the user may be interested in ﬁguring out whether a
particular website is spam

processing
All aspects of processing in Pig Latin including grouping, ﬁltering,
joining, and per-tuple processing can be customized via UDFs

processing
UDFs take non-atomic parameters as input and produce non-atomic
values as output

processing
UDFs take non-atomic parameters as input and produce non-atomic
values as output
UDFs are deﬁned in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);

Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer

Data Model
2 Tuple: A sequence of values, each with possibly a different data type

Data Model
3 Bag: A collection of tuples

Data Model
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key

Commands
LOAD: Load and deserialize an input ﬁle

Commands
FOREACH: Process each tuple of a dataset

Commands
FILTER: Filter a dataset based on some condition or UDF

Commands
COGROUP: Group together tuples which are related in some way from
one or more datasets

Commands
GROUP: Group together tuples which are related in some way from
one dataset

Commands
GROUP: Group together tuples which are related in some way from
one dataset
STORE: Materialize the output of a Pig Latin expression to a ﬁle

Other Commands
UNION: Return the union of two or more bags

Other Commands
CROSS: Return the cross product of two or more bags

Other Commands
ORDER: Order a bag by a speciﬁed ﬁeld

Other Commands
ORDER: Order a bag by a speciﬁed ﬁeld
DISTINCT: Eliminate duplicate tuples in a bag

MapReduce in PigLatin
1 map_result = FOREACH input GENERATE FLATTEN(map(∗));
2 key_groups = GROUP map_result BY $0;
3 output = FOREACH key_groups GENERATE reduce(∗);

Outline
1 Pig Latin
2 Cascading

Introduction
Many applications require a chain of MapReduce jobs

Introduction
Cascading allows the creation of processing pipelines using languages
that run atop the JVM

Introduction
Source-pipe-sink paradigm

Introduction
Data comes from sources

Introduction
Pipes perform data analysis

Introduction
Pipes perform data analysis
Results are written to sinks

Terminology
Pipe: data stream

Terminology
Pipe: data stream
Tuple: data record

Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes

Terminology
Pipe: data stream
Tuple: data record
Pipe Assembly: set of pipe branches

Terminology
Pipe: data stream
Tuple: data record
Tap: data source or sink

Terminology
Pipe: data stream
Tuple: data record
Flow: pipe assembly bound to a tap

Terminology
Pipe: data stream
Tuple: data record
Flow: pipe assembly bound to a tap
Cascade: a collection ﬂows, in which one ﬂow depends on the output
of another

Pipes
Base class: Pipe

Pipes
Base class: Pipe
Each: Analyze, transform, or ﬁlter individual tuples

Pipes
Base class: Pipe
Merge: Combine streams with same ﬁelds into one

Pipes
Base class: Pipe
GroupBy: Group tuples based on common values in a speciﬁed ﬁeld

Pipes
Base class: Pipe
CoGroup: Join streams (similar to SQL join)

Pipes
Base class: Pipe
Every: Aggregate tuples

Pipes
Base class: Pipe
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efﬁcient if one stream can
be held in memory

Pipe Assemblies
Deﬁne the processing of tuple streams

Pipe Assemblies
Tuples are read/written to taps

Pipe Assemblies
Processing includes ﬁltering, transforming, organizing, and calculating

Pipe Assemblies
Can use multiple taps

Pipe Assemblies
Can use multiple taps
May also deﬁne splits, merges, and joins to manipulate tuple streams

Example: Pipe Assembly

Example: Pipe Assembly (2)
1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
4
5 Pipe rhs = new Pipe( "rhs" );
6 rhs = new Each( rhs, new SomeFunction() );
7
8 Pipe join = new CoGroup( lhs, rhs );
9 join = new Every( join, new SomeAggregator() );
10 join = new GroupBy( join );
12
13 join = new Each( join, new SomeFunction() );

Data Processing
Operation: Accept an input tuple, process it, and output zero or more
tuples

Data Processing
tuples
Tuple: Array of ﬁelds

Data Processing
tuples
Tuple: Array of ﬁelds
Field: Deﬁnes a data type, such as string, integer, etc.

Taps
Data ﬂows in and out of taps

Taps
Represent data sources and sinks, such local ﬁles, distributed FS ﬁles,
etc.

Taps
etc.
Each tap is associated with a scheme that describe the data, such as
TextLine, TextDelimited, etc.

Taps
etc.
Each tap is associated with a scheme that describe the data, such as
TextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,
SinkMode.REPLACE, and SinkMode.UPDATE

Flows
Represent entire pipelines

Flows
Represent entire pipelines
A pipeline reads data from a source, processes it, and then writes it to
a sink

Example: Flow
1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
4 Pipe rhs = new Pipe( "rhs" );
5 rhs = new Each( rhs, new SomeFunction() );
6 Pipe join = new CoGroup( lhs, rhs );
8
9 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );
10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );
11 Tap sink = new Hfs( new TextLine(), "output" );
12 FlowDef flowDef = new FlowDef()
13 .setName( "flow−name" )
14 .addSource( rhs, rhsSource )
15 .addSource( lhs, lhsSource )
16 .addTailSink( join, sink );
17 Flow flow = new HadoopFlowConnector().connect( flowDef );

Operations
Operations manipulate data

Operations
Four kinds:
1 Function

Operations
Four kinds:
1 Function
2 Filter

Operations
Four kinds:
1 Function
2 Filter
3 Aggregator

Operations
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer

Operations
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Take an input tuple and emit zero or more tuples

Operations
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Filter returns a Boolean

Operations
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Filter returns a Boolean
Must be wrapped around in either Every or Each pipes

Example: Wordcount
1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );
2 Tap source = new Hfs( sourceScheme , inputPath );
3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );
5 Pipe assembly = new Pipe( "wordcount" );
6 String regex = " ";
7 Function function = new RegexGenerator( new Fields( "word" ), regex );
8 assembly = new Each( assembly , new Fields( "line" ), function );
9 assembly = new GroupBy( assembly , new Fields( "word" ) );
10 Aggregator count = new Count( new Fields( "count" ) );
11 assembly = new Every( assembly , count );
12 FlowConnector flowConnector = new FlowConnector();
13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );
14 flow.complete();

References
1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
2 Cascading 2.1 User Guide: http://docs.cascading.org/
cascading/2.1/userguide/pdf/userguide.pdf

MapReduce Application Scripting

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie MapReduce Application Scripting

Ähnlich wie MapReduce Application Scripting (20)

Mehr von Zubair Nabi

Mehr von Zubair Nabi (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MapReduce Application Scripting