Exploring the Future Potential of AI-Enabled Smartphone Processors
MapReduce Application Scripting
1. 8: MapReduce Application Scripting
Zubair Nabi
zubair.nabi@itu.edu.pk
May 25, 2013
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28
2. Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 2 / 28
3. Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 3 / 28
4. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
5. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
6. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
7. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that are
executed atop Hadoop
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
8. SQL query to find average pagerank for each large category
of URLs
1 SELECT category , AVG(pagerank)
2 FROM urls WHERE pagerank > 0.2
3 GROUP BY category HAVING COUNT(∗) > 10^6
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 5 / 28
9. Equivalent Pig query
1 good_urls = FILTER urls BY pagerank > 0.2;
2 groups = GROUP good_urls BY category;
3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;
4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 6 / 28
10. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
11. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
12. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
13. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
14. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Due to this dataflow model, it is easier for programmers to understand
and control how their data processing task is executed
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
15. Features
Support for a fully nested data model with complex data types
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
16. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
17. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
18. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
19. Interoperability
Queries can be performed atop raw data dumps directly
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
20. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
21. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
22. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Datasets can be laid across diverse data storage sources and
applications
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
23. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
24. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
25. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
26. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
27. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
28. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
29. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
30. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
31. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
32. Commands
LOAD: Load and deserialize an input file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
33. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
34. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
35. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
36. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
GROUP: Group together tuples which are related in some way from
one dataset
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
37. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
GROUP: Group together tuples which are related in some way from
one dataset
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
38. Other Commands
UNION: Return the union of two or more bags
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
39. Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
40. Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
41. Other Commands
UNION: Return the union of two or more bags
CROSS: Return the cross product of two or more bags
ORDER: Order a bag by a specified field
DISTINCT: Eliminate duplicate tuples in a bag
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
42. MapReduce in PigLatin
1 map_result = FOREACH input GENERATE FLATTEN(map(∗));
2 key_groups = GROUP map_result BY $0;
3 output = FOREACH key_groups GENERATE reduce(∗);
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 14 / 28
43. Outline
1 Pig Latin
2 Cascading
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 15 / 28
45. Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languages
that run atop the JVM
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
46. Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languages
that run atop the JVM
Source-pipe-sink paradigm
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
47. Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languages
that run atop the JVM
Source-pipe-sink paradigm
Data comes from sources
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
48. Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languages
that run atop the JVM
Source-pipe-sink paradigm
Data comes from sources
Pipes perform data analysis
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
49. Introduction
Many applications require a chain of MapReduce jobs
Cascading allows the creation of processing pipelines using languages
that run atop the JVM
Source-pipe-sink paradigm
Data comes from sources
Pipes perform data analysis
Results are written to sinks
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
53. Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
54. Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
55. Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
56. Terminology
Pipe: data stream
Tuple: data record
Branch: chain of pipes
Pipe Assembly: set of pipe branches
Tap: data source or sink
Flow: pipe assembly bound to a tap
Cascade: a collection flows, in which one flow depends on the output
of another
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
58. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
59. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
60. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
61. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
62. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
63. Pipes
Base class: Pipe
Each: Analyze, transform, or filter individual tuples
Merge: Combine streams with same fields into one
GroupBy: Group tuples based on common values in a specified field
CoGroup: Join streams (similar to SQL join)
Every: Aggregate tuples
HashJoin: Similar to CoGroup but more efficient if one stream can
be held in memory
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
64. Pipe Assemblies
Define the processing of tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
65. Pipe Assemblies
Define the processing of tuple streams
Tuples are read/written to taps
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
66. Pipe Assemblies
Define the processing of tuple streams
Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
67. Pipe Assemblies
Define the processing of tuple streams
Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
68. Pipe Assemblies
Define the processing of tuple streams
Tuples are read/written to taps
Processing includes filtering, transforming, organizing, and calculating
Can use multiple taps
May also define splits, merges, and joins to manipulate tuple streams
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
70. Example: Pipe Assembly (2)
1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
4
5 Pipe rhs = new Pipe( "rhs" );
6 rhs = new Each( rhs, new SomeFunction() );
7
8 Pipe join = new CoGroup( lhs, rhs );
9 join = new Every( join, new SomeAggregator() );
10 join = new GroupBy( join );
11 join = new Every( join, new SomeAggregator() );
12
13 join = new Each( join, new SomeFunction() );
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 21 / 28
71. Data Processing
Operation: Accept an input tuple, process it, and output zero or more
tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
72. Data Processing
Operation: Accept an input tuple, process it, and output zero or more
tuples
Tuple: Array of fields
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
73. Data Processing
Operation: Accept an input tuple, process it, and output zero or more
tuples
Tuple: Array of fields
Field: Defines a data type, such as string, integer, etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
74. Taps
Data flows in and out of taps
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
75. Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,
etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
76. Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,
etc.
Each tap is associated with a scheme that describe the data, such as
TextLine, TextDelimited, etc.
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
77. Taps
Data flows in and out of taps
Represent data sources and sinks, such local files, distributed FS files,
etc.
Each tap is associated with a scheme that describe the data, such as
TextLine, TextDelimited, etc.
Sinks have modes such as SinkMode.KEEP,
SinkMode.REPLACE, and SinkMode.UPDATE
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
79. Flows
Represent entire pipelines
A pipeline reads data from a source, processes it, and then writes it to
a sink
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28
80. Example: Flow
1 Pipe lhs = new Pipe( "lhs" );
2 lhs = new Each( lhs, new SomeFunction() );
3 lhs = new Each( lhs, new SomeFilter() );
4 Pipe rhs = new Pipe( "rhs" );
5 rhs = new Each( rhs, new SomeFunction() );
6 Pipe join = new CoGroup( lhs, rhs );
7 join = new Every( join, new SomeAggregator() );
8
9 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );
10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );
11 Tap sink = new Hfs( new TextLine(), "output" );
12 FlowDef flowDef = new FlowDef()
13 .setName( "flow−name" )
14 .addSource( rhs, rhsSource )
15 .addSource( lhs, lhsSource )
16 .addTailSink( join, sink );
17 Flow flow = new HadoopFlowConnector().connect( flowDef );
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 25 / 28
86. Operations
Operations manipulate data
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Take an input tuple and emit zero or more tuples
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
87. Operations
Operations manipulate data
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Take an input tuple and emit zero or more tuples
Filter returns a Boolean
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
88. Operations
Operations manipulate data
Four kinds:
1 Function
2 Filter
3 Aggregator
4 Buffer
Take an input tuple and emit zero or more tuples
Filter returns a Boolean
Must be wrapped around in either Every or Each pipes
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
89. Example: Wordcount
1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );
2 Tap source = new Hfs( sourceScheme , inputPath );
3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );
5 Pipe assembly = new Pipe( "wordcount" );
6 String regex = " ";
7 Function function = new RegexGenerator( new Fields( "word" ), regex );
8 assembly = new Each( assembly , new Fields( "line" ), function );
9 assembly = new GroupBy( assembly , new Fields( "word" ) );
10 Aggregator count = new Count( new Fields( "count" ) );
11 assembly = new Every( assembly , count );
12 FlowConnector flowConnector = new FlowConnector();
13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );
14 flow.complete();
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 27 / 28
90. References
1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
2 Cascading 2.1 User Guide: http://docs.cascading.org/
cascading/2.1/userguide/pdf/userguide.pdf
Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 28 / 28