Topic 8: Enhancements and Alternative Architectures
1. 8: Enhancements and Alternative Architectures
Zubair Nabi
zubair.nabi@itu.edu.pk
April 19, 2013
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 1 / 45
2. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 2 / 45
3. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 3 / 45
4. Focusing on some
Low-level programming interface
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
5. Focusing on some
Low-level programming interface
Iterative and recursive applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
6. Focusing on some
Low-level programming interface
Iterative and recursive applications
Incremental computations
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
7. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 5 / 45
8. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
9. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
10. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
11. Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that are
executed atop Hadoop
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
12. SQL query to find average pagerank for each large category
of URLs
1 SELECT category , AVG(pagerank)
2 FROM urls WHERE pagerank > 0.2
3 GROUP BY category HAVING COUNT(∗) > 10^6
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 7 / 45
13. Equivalent Pig query
1 good_urls = FILTER urls BY pagerank > 0.2;
2 groups = GROUP good_urls BY category;
3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;
4 output = FOREACH big_groups GENERATE
5 category , AVG(good_urls.pagerank);
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 8 / 45
14. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
15. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
16. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
17. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
18. Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Due to this dataflow model, it is easier for programmers to understand
and control how their data processing task is executed
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
19. Features
Support for a fully nested data model with complex data types
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
20. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
21. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
22. Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
23. Interoperability
Queries can be performed atop raw data dumps directly
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
24. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
25. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
26. Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Datasets can be laid across diverse data storage sources and
applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
27. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
28. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
29. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
30. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
31. UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
32. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
33. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
34. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
35. Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
36. Commands
LOAD: Load and deserialize an input file
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
37. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
38. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
39. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
40. Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
41. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 15 / 45
42. Introduction
MapReduce is strictly two stage, single input set and single output set
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
43. Introduction
MapReduce is strictly two stage, single input set and single output set
Awkward architecture to perform multi-stage computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
45. Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
46. Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
47. Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Each computation can take in multiple files as input and produce
multiple outputs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
48. Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Each computation can take in multiple files as input and produce
multiple outputs
Developed by Microsoft Research
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
50. Dryad: Architecture (2)
Files, TCP, FIFO, Network
job schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 20 / 45
51. Dryad: Job
Job = Directed Acyclic Graph
Processing
vertices Channels
(file, pipe,
shared
memory)
Inputs
Outputs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 21 / 45
52. Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
53. Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Encapsulation: Convert a graph into a single vertex, and run within
same process
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
54. Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Encapsulation: Convert a graph into a single vertex, and run within
same process
Job inputs and outputs: Can be logically concatenated
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
56. Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
57. Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
58. Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Process wrapper: To support legacy executables
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
59. Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Process wrapper: To support legacy executables
Supports event-based programming
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
60. Dryad: Example
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
U U
select distinct p.objID
from photoObjAll p
join neighbors n – call this join “X”
on p.objID = n.objID
and n.objID < n.neighborObjID
and p.mode = 1
join photoObjAll l – call this join “Y”
on l.objid = n.neighborObjID
and l.mode = 1
and abs((p.u-p.g)-(l.u-l.g))<0.05
and abs((p.g-p.r)-(l.g-l.r))<0.05
and abs((p.r-p.i)-(l.r-l.i))<0.05
and abs((p.i-p.z)-(l.i-l.z))<0.05
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 24 / 45
61. Operations
Create Vertices using C++ base class
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
62. Operations
Create Vertices using C++ base class
Add edges
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
63. Operations
Create Vertices using C++ base class
Add edges
Merge two graphs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
65. Job Execution
Vertex can specify “hard constraint” or “preference”
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
66. Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
67. Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
68. Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
69. Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Fault Tolerance: Vertices deterministic. Just re-schedule
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
70. Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Fault Tolerance: Vertices deterministic. Just re-schedule
Speculative execution within stages
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
71. Run-time Graph Refinement
Aggregation tree: Distributed combiner
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
72. Run-time Graph Refinement
Aggregation tree: Distributed combiner
Associative, and commutative computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
73. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 29 / 45
74. Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
75. Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Most machine learning and data mining applications are iterative in
nature
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
76. Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Most machine learning and data mining applications are iterative in
nature
These applications require a data-dependent control flow
The ability to spawn new tasks on the fly based on previous
computations
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
77. CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
78. CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
79. CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
If objects do not have their full contents, their references are future
references; otherwise they are concrete references
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
80. CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
If objects do not have their full contents, their references are future
references; otherwise they are concrete references
3 A job makes progress by executing tasks
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
81. Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
82. Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
83. Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
84. Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
2 A task can also spawn new tasks and delegate the creation of output to
them
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
85. Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
2 A task can also spawn new tasks and delegate the creation of output to
them
3 The dynamic task graph stores the relation between tasks and
objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
87. CIEL: Dynamic Task Graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 34 / 45
88. Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
89. Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
2 Each programming language has a corresponding executor
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
90. Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
2 Each programming language has a corresponding executor
3 As a result, a task can be written in any programming language, such
as Java, Python, shell-script, etc. as well as the indigenous Skywriting
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
91. Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
92. Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
2 Contains data-dependent control flow constructs such as while loops
and recursive functions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
93. Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
2 Contains data-dependent control flow constructs such as while loops
and recursive functions
3 Ability to spawn new tasks in the middle of execution
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
94. Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
95. Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
96. Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
97. Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
4 spawn_exec(executor, args, n): Spawns a new task to run
the given executor
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
98. Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
4 spawn_exec(executor, args, n): Spawns a new task to run
the given executor
5 *-: De-references the given reference
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
99. Example: Skywriting
1 function process_chunk(chunk, prev_result) {
2 return spawn_exec(...);
3 }
4 function is_converged(curr_result , prev_result) {
5 return spawn_exec(...)[0];
6 }
7 input_data = [ref("ciel://host137/chunk0"),
8 ref("ciel://host223/chunk1"), ...];
9 curr = ...; // Initial guess at the result.
10 do {
11 prev = curr;
12 curr = [];
13 for (chunk in input_data) {
14 curr += process_chunk(chunk, prev);
15 }
16 } while (!∗is_converged(curr, prev));
17 return curr;
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 38 / 45
100. Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 39 / 45
101. Introduction
A class of applications requires support for both iterative and
incremental computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
102. Introduction
A class of applications requires support for both iterative and
incremental computation
For instance, to maintain in real-time the strongly connected
component structure in the graph induced by Twitter mentions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
103. Introduction
A class of applications requires support for both iterative and
incremental computation
For instance, to maintain in real-time the strongly connected
component structure in the graph induced by Twitter mentions
Currently, MapReduce itself has no support for either iterative or
incremental computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
104. Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
105. Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
106. Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
107. Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
2 The set of updates required to reconstruct the state at any version is
retained in an indexed data-structure
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
108. Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
2 The set of updates required to reconstruct the state at any version is
retained in an indexed data-structure
The state and updates to that state are associated with a
multi-dimensional logical timestamp (called a version)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
109. Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
110. Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
111. Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Naiad adds two more operators:
1 FixedPoint that takes a source collection and a function that
mutates the collection to another collection of the same type to achieve
fixed-point convergence
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
112. Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Naiad adds two more operators:
1 FixedPoint that takes a source collection and a function that
mutates the collection to another collection of the same type to achieve
fixed-point convergence
2 PrioritizedFP additionally takes a priority function to apply to
every record in the source collection
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
113. Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
114. Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
The user program can insert differences into the input collections and
register callbacks to be invoked when differences are received at the
output collection
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
115. Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
The user program can insert differences into the input collections and
register callbacks to be invoked when differences are received at the
output collection
The runtime transparently distributes the execution of the data flow
graph (similar to Dryad) across several cores and nodes
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
116. References
1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
2 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from
sequential building blocks. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys ’07). ACM, New York, NY, USA, 59-72.
3 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
execution engine for distributed data-flow computing. In Proceedings of
the 8th USENIX conference on Networked systems design and
implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 44 / 45
117. References (2)
4 Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard.
Differential dataflow. 2013. In Conference on Innovative Data Systems
Research (CIDR), 2013.
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 45 / 45