SlideShare ist ein Scribd-Unternehmen logo
1 von 117
Downloaden Sie, um offline zu lesen
8: Enhancements and Alternative Architectures
Zubair Nabi
zubair.nabi@itu.edu.pk
April 19, 2013
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 1 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 2 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 3 / 45
Focusing on some
Low-level programming interface
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
Focusing on some
Low-level programming interface
Iterative and recursive applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
Focusing on some
Low-level programming interface
Iterative and recursive applications
Incremental computations
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 5 / 45
Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
Introduction
MapReduce is too low-level and rigid and leads to lots of custom user
code
Pig Latin is a declarative language atop MapReduce designed by
Yahoo!
Finds the sweet spot between the declarative style of SQL and the
low-level interface of MapReduce
The Pig system compiles Pig Latin queries into physical plans that are
executed atop Hadoop
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
SQL query to find average pagerank for each large category
of URLs
1 SELECT category , AVG(pagerank)
2 FROM urls WHERE pagerank > 0.2
3 GROUP BY category HAVING COUNT(∗) > 10^6
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 7 / 45
Equivalent Pig query
1 good_urls = FILTER urls BY pagerank > 0.2;
2 groups = GROUP good_urls BY category;
3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;
4 output = FOREACH big_groups GENERATE
5 category , AVG(good_urls.pagerank);
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 8 / 45
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
Pig Interface
A Pig Latin program is a sequence of steps, reminiscent of traditional
programming languages
In contrast, SQL consists of declarative constraints that collectively
define the result
Each step carries out a single data transformation
A Pig Latin program is similar to specifying a query execution or a
dataflow graph
Due to this dataflow model, it is easier for programmers to understand
and control how their data processing task is executed
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
Features
Support for a fully nested data model with complex data types
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
Features
Support for a fully nested data model with complex data types
Extensive support for user-defined functions
Ability to operate over plain, schema-less input files
Open-source Apache project
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
Interoperability
Queries can be performed atop raw data dumps directly
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
Interoperability
Queries can be performed atop raw data dumps directly
The user needs to provide a function to parse the content of the file into
tuples
Similarly, the user also needs to provide a function to convert tuples
into a byte sequence
Datasets can be laid across diverse data storage sources and
applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
UDFs as first-class citizens
A significant part of large-scale data analysis relies on custom
processing
For instance, the user may be interested in figuring out whether a
particular website is spam
All aspects of processing in Pig Latin including grouping, filtering,
joining, and per-tuple processing can be customized via UDFs
UDFs take non-atomic parameters as input and produce non-atomic
values as output
UDFs are defined in Java
1 groups = GROUP urls BY category;
2 output = FOREACH groups GENERATE
3 category , top10(urls);
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
Data Model
Pig has four data types:
1 Atom: A single atomic value such as a string or an integer
2 Tuple: A sequence of values, each with possibly a different data type
3 Bag: A collection of tuples
4 Map: A collection of data types, each with an associated key
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
Commands
LOAD: Load and deserialize an input file
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
Commands
LOAD: Load and deserialize an input file
FOREACH: Process each tuple of a dataset
FILTER: Filter a dataset based on some condition or UDF
COGROUP: Group together tuples which are related in some way from
one or more datasets
STORE: Materialize the output of a Pig Latin expression to a file
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 15 / 45
Introduction
MapReduce is strictly two stage, single input set and single output set
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
Introduction
MapReduce is strictly two stage, single input set and single output set
Awkward architecture to perform multi-stage computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
MapReduce: Architecture
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 17 / 45
Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Each computation can take in multiple files as input and produce
multiple outputs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
Dryad
Dryad allows computations that can form a Directed Acyclic Graph
(DAG)
Each vertice within the graph is a computation while an edge depicts
communication channels
Each computation can take in multiple files as input and produce
multiple outputs
Developed by Microsoft Research
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
Dryad: Architecture
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 19 / 45
Dryad: Architecture (2)
Files, TCP, FIFO, Network
job schedule
data plane
control plane
NS PD PDPD
V V V
Job manager cluster
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 20 / 45
Dryad: Job
Job = Directed Acyclic Graph
Processing
vertices Channels
(file, pipe,
shared
memory)
Inputs
Outputs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 21 / 45
Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Encapsulation: Convert a graph into a single vertex, and run within
same process
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
Channel types and job inputs and outputs
Channel types: File, TCP pipe, Shared-memory FIFO
Encapsulation: Convert a graph into a single vertex, and run within
same process
Job inputs and outputs: Can be logically concatenated
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
Vertices
Programming in C++/C#
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Process wrapper: To support legacy executables
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
Vertices
Programming in C++/C#
Runtime library sets up and executes vertices
Map and Reduce classes
Process wrapper: To support legacy executables
Supports event-based programming
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
Dryad: Example
D D
MM 4n
SS 4n
YY
H
n
n
X Xn
U UN N
U U
select distinct p.objID
from photoObjAll p
join neighbors n – call this join “X”
on p.objID = n.objID
and n.objID < n.neighborObjID
and p.mode = 1
join photoObjAll l – call this join “Y”
on l.objid = n.neighborObjID
and l.mode = 1
and abs((p.u-p.g)-(l.u-l.g))<0.05
and abs((p.g-p.r)-(l.g-l.r))<0.05
and abs((p.r-p.i)-(l.r-l.i))<0.05
and abs((p.i-p.z)-(l.i-l.z))<0.05
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 24 / 45
Operations
Create Vertices using C++ base class
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
Operations
Create Vertices using C++ base class
Add edges
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
Operations
Create Vertices using C++ base class
Add edges
Merge two graphs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
Operations (2)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 26 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Fault Tolerance: Vertices deterministic. Just re-schedule
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Job Execution
Vertex can specify “hard constraint” or “preference”
Job manager runs greedy scheduling algorithm: Only job running
Simple graph visualizer: State of each vertex and channel for small
jobs
Web-based interface: Regularly-updated statistics
Fault Tolerance: Vertices deterministic. Just re-schedule
Speculative execution within stages
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
Run-time Graph Refinement
Aggregation tree: Distributed combiner
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
Run-time Graph Refinement
Aggregation tree: Distributed combiner
Associative, and commutative computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 29 / 45
Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Most machine learning and data mining applications are iterative in
nature
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
Introduction
MapReduce and Dryad are not amenable to iterative and recursive
applications
Most machine learning and data mining applications are iterative in
nature
These applications require a data-dependent control flow
The ability to spawn new tasks on the fly based on previous
computations
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
If objects do not have their full contents, their references are future
references; otherwise they are concrete references
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
CIEL
1 Data-centric execution engine from Cambridge: the goal of a CIEL job
is to produce one or more output objects
2 A reference can be obtained to an object without materializing its full
contents, reminiscent of C pointers
If objects do not have their full contents, their references are future
references; otherwise they are concrete references
3 A job makes progress by executing tasks
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
2 A task can also spawn new tasks and delegate the creation of output to
them
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
Tasks
1 Each task has dependencies on one of more objects via references
and it starts executing once all of its references become concrete
2 The purpose of each task is to produce objects
1 A task can publish one or more objects by creating a concrete
reference for them
2 A task can also spawn new tasks and delegate the creation of output to
them
3 The dynamic task graph stores the relation between tasks and
objects
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
CIEL: Architecture
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 33 / 45
CIEL: Dynamic Task Graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 34 / 45
Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
2 Each programming language has a corresponding executor
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
Executors
1 CIEL maintains a decoupling between tasks and the underlying
framework through the concept of executors
2 Each programming language has a corresponding executor
3 As a result, a task can be written in any programming language, such
as Java, Python, shell-script, etc. as well as the indigenous Skywriting
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
2 Contains data-dependent control flow constructs such as while loops
and recursive functions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
Skywriting
1 Scripting language for expressing task-level parallelism atop CIEL
2 Contains data-dependent control flow constructs such as while loops
and recursive functions
3 Ability to spawn new tasks in the middle of execution
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
4 spawn_exec(executor, args, n): Spawns a new task to run
the given executor
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
Skywriting constructs
1 ref(url): Returns a reference to the object located (both local and
remote) at url
2 spawn(f, [arg, ...]): Spawns a task to evaluate f
3 exec(executor, args, n): Runs the given executor to
evaluate args
4 spawn_exec(executor, args, n): Spawns a new task to run
the given executor
5 *-: De-references the given reference
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
Example: Skywriting
1 function process_chunk(chunk, prev_result) {
2 return spawn_exec(...);
3 }
4 function is_converged(curr_result , prev_result) {
5 return spawn_exec(...)[0];
6 }
7 input_data = [ref("ciel://host137/chunk0"),
8 ref("ciel://host223/chunk1"), ...];
9 curr = ...; // Initial guess at the result.
10 do {
11 prev = curr;
12 curr = [];
13 for (chunk in input_data) {
14 curr += process_chunk(chunk, prev);
15 }
16 } while (!∗is_converged(curr, prev));
17 return curr;
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 38 / 45
Outline
1 Major shortcomings
2 Pig Latin
3 Dryad
4 CIEL
5 Naiad
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 39 / 45
Introduction
A class of applications requires support for both iterative and
incremental computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
Introduction
A class of applications requires support for both iterative and
incremental computation
For instance, to maintain in real-time the strongly connected
component structure in the graph induced by Twitter mentions
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
Introduction
A class of applications requires support for both iterative and
incremental computation
For instance, to maintain in real-time the strongly connected
component structure in the graph induced by Twitter mentions
Currently, MapReduce itself has no support for either iterative or
incremental computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
2 The set of updates required to reconstruct the state at any version is
retained in an indexed data-structure
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
Naiad
Data-intensive computing framework from Microsoft Research that
supports both incremental and iterative computation by leveraging
differential computation
Differential computation adds two novelty factors to the framework:
1 The state of the computation varies according to a partially ordered set
of versions rather than a total ordering
2 The set of updates required to reconstruct the state at any version is
retained in an indexed data-structure
The state and updates to that state are associated with a
multi-dimensional logical timestamp (called a version)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Naiad adds two more operators:
1 FixedPoint that takes a source collection and a function that
mutates the collection to another collection of the same type to achieve
fixed-point convergence
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
Programming environment
Declarative query language based on the .NET Language Integrated
Query (LINQ)
LINQ extends C# with declarative operators, such as Select,
Where, Join, and GroupBy, among others
Naiad adds two more operators:
1 FixedPoint that takes a source collection and a function that
mutates the collection to another collection of the same type to achieve
fixed-point convergence
2 PrioritizedFP additionally takes a priority function to apply to
every record in the source collection
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
The user program can insert differences into the input collections and
register callbacks to be invoked when differences are received at the
output collection
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
Runtime
The Naiad runtime transforms declarative queries to a dataflow graph
The user program can insert differences into the input collections and
register callbacks to be invoked when differences are received at the
output collection
The runtime transparently distributes the execution of the data flow
graph (similar to Dryad) across several cores and nodes
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
References
1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,
and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for
data processing. In Proceedings of the 2008 ACM SIGMOD
international conference on Management of data (SIGMOD ’08). ACM,
New York, NY, USA, 1099-1110.
2 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis
Fetterly. 2007. Dryad: distributed data-parallel programs from
sequential building blocks. In Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys ’07). ACM, New York, NY, USA, 59-72.
3 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven
Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal
execution engine for distributed data-flow computing. In Proceedings of
the 8th USENIX conference on Networked systems design and
implementation (NSDI’11). USENIX Association, Berkeley, CA, USA.
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 44 / 45
References (2)
4 Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard.
Differential dataflow. 2013. In Conference on Innovative Data Systems
Research (CIDR), 2013.
Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 45 / 45

Weitere ähnliche Inhalte

Was ist angesagt?

Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Olaf Hartig
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisPeter Bouda
 
Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutMediaMixerCommunity
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Introducing The R Software
Introducing The R Software  Introducing The R Software
Introducing The R Software Kamarul Imran
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinthsDaniel Camarda
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceBarry Norton
 
Modern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyModern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyMarius Butuc
 

Was ist angesagt? (11)

Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 1 (...
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Re-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playoutRe-using Media on the Web: Media fragment re-mixing and playout
Re-using Media on the Web: Media fragment re-mixing and playout
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Introducing The R Software
Introducing The R Software  Introducing The R Software
Introducing The R Software
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Querying Linked Data
Querying Linked DataQuerying Linked Data
Querying Linked Data
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
Find your way in Graph labyrinths
Find your way in Graph labyrinthsFind your way in Graph labyrinths
Find your way in Graph labyrinths
 
Linked Data, Ontologies and Inference
Linked Data, Ontologies and InferenceLinked Data, Ontologies and Inference
Linked Data, Ontologies and Inference
 
Modern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative studyModern PHP RDF toolkits: a comparative study
Modern PHP RDF toolkits: a comparative study
 

Ähnlich wie Topic 8: Enhancements and Alternative Architectures

MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application ScriptingZubair Nabi
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in ActionZubair Nabi
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data ModelsFIWARE
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmZubair Nabi
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google FilesystemZubair Nabi
 
Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationZubair Nabi
 
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Olaf Hartig
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Intro to GraphQL for Database Developers
Intro to GraphQL for Database DevelopersIntro to GraphQL for Database Developers
Intro to GraphQL for Database DevelopersDaniel McGhan
 
Introduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute BeginnersIntroduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute Beginnerseverlearner
 
FIWARE Training: Introduction to Smart Data Models
FIWARE Training: Introduction to Smart Data ModelsFIWARE Training: Introduction to Smart Data Models
FIWARE Training: Introduction to Smart Data ModelsFIWARE
 
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...Evangelos Kalampokis
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQLRodrigo Prates
 
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...FIWARE
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 

Ähnlich wie Topic 8: Enhancements and Alternative Architectures (20)

MapReduce Application Scripting
MapReduce Application ScriptingMapReduce Application Scripting
MapReduce Application Scripting
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
 
Introduction to Smart Data Models
Introduction to Smart Data ModelsIntroduction to Smart Data Models
Introduction to Smart Data Models
 
Topic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce ParadigmTopic 7: Shortcomings in the MapReduce Paradigm
Topic 7: Shortcomings in the MapReduce Paradigm
 
Topic 11: Google Filesystem
Topic 11: Google FilesystemTopic 11: Google Filesystem
Topic 11: Google Filesystem
 
Topic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and ImplementationTopic 5: MapReduce Theory and Implementation
Topic 5: MapReduce Theory and Implementation
 
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4jNeo4j Graph Platform Overview, Kurt Freytag, Neo4j
Neo4j Graph Platform Overview, Kurt Freytag, Neo4j
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Intro to GraphQL for Database Developers
Intro to GraphQL for Database DevelopersIntro to GraphQL for Database Developers
Intro to GraphQL for Database Developers
 
Introduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute BeginnersIntroduction to Drupal for Absolute Beginners
Introduction to Drupal for Absolute Beginners
 
FIWARE Training: Introduction to Smart Data Models
FIWARE Training: Introduction to Smart Data ModelsFIWARE Training: Introduction to Smart Data Models
FIWARE Training: Introduction to Smart Data Models
 
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
Creating and Utilizing Linked Open Statistical Data for the Development of Ad...
 
Introduction to GraphQL
Introduction to GraphQLIntroduction to GraphQL
Introduction to GraphQL
 
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
FIWARE Wednesday Webinars - NGSI-LD and Smart Data Models: Standard Access to...
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Methodology for the publication of Linked Open Data from small and medium siz...
Methodology for the publication of Linked Open Data from small and medium siz...Methodology for the publication of Linked Open Data from small and medium siz...
Methodology for the publication of Linked Open Data from small and medium siz...
 

Mehr von Zubair Nabi

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationZubair Nabi
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: VirtualizationZubair Nabi
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondZubair Nabi
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksZubair Nabi
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversZubair Nabi
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tablesZubair Nabi
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: SchedulingZubair Nabi
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System callsZubair Nabi
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itZubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!Zubair Nabi
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!Zubair Nabi
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldZubair Nabi
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanZubair Nabi
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS HybridsZubair Nabi
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingZubair Nabi
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationZubair Nabi
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud StacksZubair Nabi
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetZubair Nabi
 

Mehr von Zubair Nabi (20)

AOS Lab 12: Network Communication
AOS Lab 12: Network CommunicationAOS Lab 12: Network Communication
AOS Lab 12: Network Communication
 
AOS Lab 11: Virtualization
AOS Lab 11: VirtualizationAOS Lab 11: Virtualization
AOS Lab 11: Virtualization
 
AOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyondAOS Lab 10: File system -- Inodes and beyond
AOS Lab 10: File system -- Inodes and beyond
 
AOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocksAOS Lab 9: File system -- Of buffers, logs, and blocks
AOS Lab 9: File system -- Of buffers, logs, and blocks
 
AOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device DriversAOS Lab 8: Interrupts and Device Drivers
AOS Lab 8: Interrupts and Device Drivers
 
AOS Lab 7: Page tables
AOS Lab 7: Page tablesAOS Lab 7: Page tables
AOS Lab 7: Page tables
 
AOS Lab 6: Scheduling
AOS Lab 6: SchedulingAOS Lab 6: Scheduling
AOS Lab 6: Scheduling
 
AOS Lab 5: System calls
AOS Lab 5: System callsAOS Lab 5: System calls
AOS Lab 5: System calls
 
AOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on itAOS Lab 4: If you liked it, then you should have put a “lock” on it
AOS Lab 4: If you liked it, then you should have put a “lock” on it
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!AOS Lab 2: Hello, xv6!
AOS Lab 2: Hello, xv6!
 
AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!AOS Lab 1: Hello, Linux!
AOS Lab 1: Hello, Linux!
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Raabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing WorldRaabta: Low-cost Video Conferencing for the Developing World
Raabta: Low-cost Video Conferencing for the Developing World
 
The Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in PakistanThe Anatomy of Web Censorship in Pakistan
The Anatomy of Web Censorship in Pakistan
 
MapReduce and DBMS Hybrids
MapReduce and DBMS HybridsMapReduce and DBMS Hybrids
MapReduce and DBMS Hybrids
 
Topic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and NetworkingTopic 15: Datacenter Design and Networking
Topic 15: Datacenter Design and Networking
 
Topic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and VirtualizationTopic 14: Operating Systems and Virtualization
Topic 14: Operating Systems and Virtualization
 
Topic 13: Cloud Stacks
Topic 13: Cloud StacksTopic 13: Cloud Stacks
Topic 13: Cloud Stacks
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Topic 8: Enhancements and Alternative Architectures

  • 1. 8: Enhancements and Alternative Architectures Zubair Nabi zubair.nabi@itu.edu.pk April 19, 2013 Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 1 / 45
  • 2. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 2 / 45
  • 3. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 3 / 45
  • 4. Focusing on some Low-level programming interface Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
  • 5. Focusing on some Low-level programming interface Iterative and recursive applications Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
  • 6. Focusing on some Low-level programming interface Iterative and recursive applications Incremental computations Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 4 / 45
  • 7. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 5 / 45
  • 8. Introduction MapReduce is too low-level and rigid and leads to lots of custom user code Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
  • 9. Introduction MapReduce is too low-level and rigid and leads to lots of custom user code Pig Latin is a declarative language atop MapReduce designed by Yahoo! Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
  • 10. Introduction MapReduce is too low-level and rigid and leads to lots of custom user code Pig Latin is a declarative language atop MapReduce designed by Yahoo! Finds the sweet spot between the declarative style of SQL and the low-level interface of MapReduce Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
  • 11. Introduction MapReduce is too low-level and rigid and leads to lots of custom user code Pig Latin is a declarative language atop MapReduce designed by Yahoo! Finds the sweet spot between the declarative style of SQL and the low-level interface of MapReduce The Pig system compiles Pig Latin queries into physical plans that are executed atop Hadoop Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 6 / 45
  • 12. SQL query to find average pagerank for each large category of URLs 1 SELECT category , AVG(pagerank) 2 FROM urls WHERE pagerank > 0.2 3 GROUP BY category HAVING COUNT(∗) > 10^6 Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 7 / 45
  • 13. Equivalent Pig query 1 good_urls = FILTER urls BY pagerank > 0.2; 2 groups = GROUP good_urls BY category; 3 big_groups = FILTER groups BY COUNT(good_urls)>10^6; 4 output = FOREACH big_groups GENERATE 5 category , AVG(good_urls.pagerank); Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 8 / 45
  • 14. Pig Interface A Pig Latin program is a sequence of steps, reminiscent of traditional programming languages Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
  • 15. Pig Interface A Pig Latin program is a sequence of steps, reminiscent of traditional programming languages In contrast, SQL consists of declarative constraints that collectively define the result Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
  • 16. Pig Interface A Pig Latin program is a sequence of steps, reminiscent of traditional programming languages In contrast, SQL consists of declarative constraints that collectively define the result Each step carries out a single data transformation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
  • 17. Pig Interface A Pig Latin program is a sequence of steps, reminiscent of traditional programming languages In contrast, SQL consists of declarative constraints that collectively define the result Each step carries out a single data transformation A Pig Latin program is similar to specifying a query execution or a dataflow graph Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
  • 18. Pig Interface A Pig Latin program is a sequence of steps, reminiscent of traditional programming languages In contrast, SQL consists of declarative constraints that collectively define the result Each step carries out a single data transformation A Pig Latin program is similar to specifying a query execution or a dataflow graph Due to this dataflow model, it is easier for programmers to understand and control how their data processing task is executed Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 9 / 45
  • 19. Features Support for a fully nested data model with complex data types Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
  • 20. Features Support for a fully nested data model with complex data types Extensive support for user-defined functions Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
  • 21. Features Support for a fully nested data model with complex data types Extensive support for user-defined functions Ability to operate over plain, schema-less input files Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
  • 22. Features Support for a fully nested data model with complex data types Extensive support for user-defined functions Ability to operate over plain, schema-less input files Open-source Apache project Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 10 / 45
  • 23. Interoperability Queries can be performed atop raw data dumps directly Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
  • 24. Interoperability Queries can be performed atop raw data dumps directly The user needs to provide a function to parse the content of the file into tuples Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
  • 25. Interoperability Queries can be performed atop raw data dumps directly The user needs to provide a function to parse the content of the file into tuples Similarly, the user also needs to provide a function to convert tuples into a byte sequence Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
  • 26. Interoperability Queries can be performed atop raw data dumps directly The user needs to provide a function to parse the content of the file into tuples Similarly, the user also needs to provide a function to convert tuples into a byte sequence Datasets can be laid across diverse data storage sources and applications Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 11 / 45
  • 27. UDFs as first-class citizens A significant part of large-scale data analysis relies on custom processing Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
  • 28. UDFs as first-class citizens A significant part of large-scale data analysis relies on custom processing For instance, the user may be interested in figuring out whether a particular website is spam Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
  • 29. UDFs as first-class citizens A significant part of large-scale data analysis relies on custom processing For instance, the user may be interested in figuring out whether a particular website is spam All aspects of processing in Pig Latin including grouping, filtering, joining, and per-tuple processing can be customized via UDFs Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
  • 30. UDFs as first-class citizens A significant part of large-scale data analysis relies on custom processing For instance, the user may be interested in figuring out whether a particular website is spam All aspects of processing in Pig Latin including grouping, filtering, joining, and per-tuple processing can be customized via UDFs UDFs take non-atomic parameters as input and produce non-atomic values as output Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
  • 31. UDFs as first-class citizens A significant part of large-scale data analysis relies on custom processing For instance, the user may be interested in figuring out whether a particular website is spam All aspects of processing in Pig Latin including grouping, filtering, joining, and per-tuple processing can be customized via UDFs UDFs take non-atomic parameters as input and produce non-atomic values as output UDFs are defined in Java 1 groups = GROUP urls BY category; 2 output = FOREACH groups GENERATE 3 category , top10(urls); Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 12 / 45
  • 32. Data Model Pig has four data types: 1 Atom: A single atomic value such as a string or an integer Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
  • 33. Data Model Pig has four data types: 1 Atom: A single atomic value such as a string or an integer 2 Tuple: A sequence of values, each with possibly a different data type Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
  • 34. Data Model Pig has four data types: 1 Atom: A single atomic value such as a string or an integer 2 Tuple: A sequence of values, each with possibly a different data type 3 Bag: A collection of tuples Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
  • 35. Data Model Pig has four data types: 1 Atom: A single atomic value such as a string or an integer 2 Tuple: A sequence of values, each with possibly a different data type 3 Bag: A collection of tuples 4 Map: A collection of data types, each with an associated key Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 13 / 45
  • 36. Commands LOAD: Load and deserialize an input file Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
  • 37. Commands LOAD: Load and deserialize an input file FOREACH: Process each tuple of a dataset Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
  • 38. Commands LOAD: Load and deserialize an input file FOREACH: Process each tuple of a dataset FILTER: Filter a dataset based on some condition or UDF Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
  • 39. Commands LOAD: Load and deserialize an input file FOREACH: Process each tuple of a dataset FILTER: Filter a dataset based on some condition or UDF COGROUP: Group together tuples which are related in some way from one or more datasets Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
  • 40. Commands LOAD: Load and deserialize an input file FOREACH: Process each tuple of a dataset FILTER: Filter a dataset based on some condition or UDF COGROUP: Group together tuples which are related in some way from one or more datasets STORE: Materialize the output of a Pig Latin expression to a file Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 14 / 45
  • 41. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 15 / 45
  • 42. Introduction MapReduce is strictly two stage, single input set and single output set Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
  • 43. Introduction MapReduce is strictly two stage, single input set and single output set Awkward architecture to perform multi-stage computation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 16 / 45
  • 44. MapReduce: Architecture Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 17 / 45
  • 45. Dryad Dryad allows computations that can form a Directed Acyclic Graph (DAG) Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
  • 46. Dryad Dryad allows computations that can form a Directed Acyclic Graph (DAG) Each vertice within the graph is a computation while an edge depicts communication channels Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
  • 47. Dryad Dryad allows computations that can form a Directed Acyclic Graph (DAG) Each vertice within the graph is a computation while an edge depicts communication channels Each computation can take in multiple files as input and produce multiple outputs Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
  • 48. Dryad Dryad allows computations that can form a Directed Acyclic Graph (DAG) Each vertice within the graph is a computation while an edge depicts communication channels Each computation can take in multiple files as input and produce multiple outputs Developed by Microsoft Research Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 18 / 45
  • 49. Dryad: Architecture Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 19 / 45
  • 50. Dryad: Architecture (2) Files, TCP, FIFO, Network job schedule data plane control plane NS PD PDPD V V V Job manager cluster Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 20 / 45
  • 51. Dryad: Job Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 21 / 45
  • 52. Channel types and job inputs and outputs Channel types: File, TCP pipe, Shared-memory FIFO Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
  • 53. Channel types and job inputs and outputs Channel types: File, TCP pipe, Shared-memory FIFO Encapsulation: Convert a graph into a single vertex, and run within same process Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
  • 54. Channel types and job inputs and outputs Channel types: File, TCP pipe, Shared-memory FIFO Encapsulation: Convert a graph into a single vertex, and run within same process Job inputs and outputs: Can be logically concatenated Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 22 / 45
  • 55. Vertices Programming in C++/C# Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
  • 56. Vertices Programming in C++/C# Runtime library sets up and executes vertices Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
  • 57. Vertices Programming in C++/C# Runtime library sets up and executes vertices Map and Reduce classes Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
  • 58. Vertices Programming in C++/C# Runtime library sets up and executes vertices Map and Reduce classes Process wrapper: To support legacy executables Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
  • 59. Vertices Programming in C++/C# Runtime library sets up and executes vertices Map and Reduce classes Process wrapper: To support legacy executables Supports event-based programming Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 23 / 45
  • 60. Dryad: Example D D MM 4n SS 4n YY H n n X Xn U UN N U U select distinct p.objID from photoObjAll p join neighbors n – call this join “X” on p.objID = n.objID and n.objID < n.neighborObjID and p.mode = 1 join photoObjAll l – call this join “Y” on l.objid = n.neighborObjID and l.mode = 1 and abs((p.u-p.g)-(l.u-l.g))<0.05 and abs((p.g-p.r)-(l.g-l.r))<0.05 and abs((p.r-p.i)-(l.r-l.i))<0.05 and abs((p.i-p.z)-(l.i-l.z))<0.05 Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 24 / 45
  • 61. Operations Create Vertices using C++ base class Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
  • 62. Operations Create Vertices using C++ base class Add edges Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
  • 63. Operations Create Vertices using C++ base class Add edges Merge two graphs Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 25 / 45
  • 64. Operations (2) Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 26 / 45
  • 65. Job Execution Vertex can specify “hard constraint” or “preference” Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 66. Job Execution Vertex can specify “hard constraint” or “preference” Job manager runs greedy scheduling algorithm: Only job running Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 67. Job Execution Vertex can specify “hard constraint” or “preference” Job manager runs greedy scheduling algorithm: Only job running Simple graph visualizer: State of each vertex and channel for small jobs Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 68. Job Execution Vertex can specify “hard constraint” or “preference” Job manager runs greedy scheduling algorithm: Only job running Simple graph visualizer: State of each vertex and channel for small jobs Web-based interface: Regularly-updated statistics Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 69. Job Execution Vertex can specify “hard constraint” or “preference” Job manager runs greedy scheduling algorithm: Only job running Simple graph visualizer: State of each vertex and channel for small jobs Web-based interface: Regularly-updated statistics Fault Tolerance: Vertices deterministic. Just re-schedule Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 70. Job Execution Vertex can specify “hard constraint” or “preference” Job manager runs greedy scheduling algorithm: Only job running Simple graph visualizer: State of each vertex and channel for small jobs Web-based interface: Regularly-updated statistics Fault Tolerance: Vertices deterministic. Just re-schedule Speculative execution within stages Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 27 / 45
  • 71. Run-time Graph Refinement Aggregation tree: Distributed combiner Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
  • 72. Run-time Graph Refinement Aggregation tree: Distributed combiner Associative, and commutative computation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 28 / 45
  • 73. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 29 / 45
  • 74. Introduction MapReduce and Dryad are not amenable to iterative and recursive applications Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
  • 75. Introduction MapReduce and Dryad are not amenable to iterative and recursive applications Most machine learning and data mining applications are iterative in nature Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
  • 76. Introduction MapReduce and Dryad are not amenable to iterative and recursive applications Most machine learning and data mining applications are iterative in nature These applications require a data-dependent control flow The ability to spawn new tasks on the fly based on previous computations Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 30 / 45
  • 77. CIEL 1 Data-centric execution engine from Cambridge: the goal of a CIEL job is to produce one or more output objects Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
  • 78. CIEL 1 Data-centric execution engine from Cambridge: the goal of a CIEL job is to produce one or more output objects 2 A reference can be obtained to an object without materializing its full contents, reminiscent of C pointers Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
  • 79. CIEL 1 Data-centric execution engine from Cambridge: the goal of a CIEL job is to produce one or more output objects 2 A reference can be obtained to an object without materializing its full contents, reminiscent of C pointers If objects do not have their full contents, their references are future references; otherwise they are concrete references Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
  • 80. CIEL 1 Data-centric execution engine from Cambridge: the goal of a CIEL job is to produce one or more output objects 2 A reference can be obtained to an object without materializing its full contents, reminiscent of C pointers If objects do not have their full contents, their references are future references; otherwise they are concrete references 3 A job makes progress by executing tasks Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 31 / 45
  • 81. Tasks 1 Each task has dependencies on one of more objects via references and it starts executing once all of its references become concrete Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
  • 82. Tasks 1 Each task has dependencies on one of more objects via references and it starts executing once all of its references become concrete 2 The purpose of each task is to produce objects Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
  • 83. Tasks 1 Each task has dependencies on one of more objects via references and it starts executing once all of its references become concrete 2 The purpose of each task is to produce objects 1 A task can publish one or more objects by creating a concrete reference for them Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
  • 84. Tasks 1 Each task has dependencies on one of more objects via references and it starts executing once all of its references become concrete 2 The purpose of each task is to produce objects 1 A task can publish one or more objects by creating a concrete reference for them 2 A task can also spawn new tasks and delegate the creation of output to them Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
  • 85. Tasks 1 Each task has dependencies on one of more objects via references and it starts executing once all of its references become concrete 2 The purpose of each task is to produce objects 1 A task can publish one or more objects by creating a concrete reference for them 2 A task can also spawn new tasks and delegate the creation of output to them 3 The dynamic task graph stores the relation between tasks and objects Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 32 / 45
  • 86. CIEL: Architecture Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 33 / 45
  • 87. CIEL: Dynamic Task Graph Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 34 / 45
  • 88. Executors 1 CIEL maintains a decoupling between tasks and the underlying framework through the concept of executors Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
  • 89. Executors 1 CIEL maintains a decoupling between tasks and the underlying framework through the concept of executors 2 Each programming language has a corresponding executor Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
  • 90. Executors 1 CIEL maintains a decoupling between tasks and the underlying framework through the concept of executors 2 Each programming language has a corresponding executor 3 As a result, a task can be written in any programming language, such as Java, Python, shell-script, etc. as well as the indigenous Skywriting Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 35 / 45
  • 91. Skywriting 1 Scripting language for expressing task-level parallelism atop CIEL Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
  • 92. Skywriting 1 Scripting language for expressing task-level parallelism atop CIEL 2 Contains data-dependent control flow constructs such as while loops and recursive functions Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
  • 93. Skywriting 1 Scripting language for expressing task-level parallelism atop CIEL 2 Contains data-dependent control flow constructs such as while loops and recursive functions 3 Ability to spawn new tasks in the middle of execution Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 36 / 45
  • 94. Skywriting constructs 1 ref(url): Returns a reference to the object located (both local and remote) at url Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
  • 95. Skywriting constructs 1 ref(url): Returns a reference to the object located (both local and remote) at url 2 spawn(f, [arg, ...]): Spawns a task to evaluate f Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
  • 96. Skywriting constructs 1 ref(url): Returns a reference to the object located (both local and remote) at url 2 spawn(f, [arg, ...]): Spawns a task to evaluate f 3 exec(executor, args, n): Runs the given executor to evaluate args Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
  • 97. Skywriting constructs 1 ref(url): Returns a reference to the object located (both local and remote) at url 2 spawn(f, [arg, ...]): Spawns a task to evaluate f 3 exec(executor, args, n): Runs the given executor to evaluate args 4 spawn_exec(executor, args, n): Spawns a new task to run the given executor Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
  • 98. Skywriting constructs 1 ref(url): Returns a reference to the object located (both local and remote) at url 2 spawn(f, [arg, ...]): Spawns a task to evaluate f 3 exec(executor, args, n): Runs the given executor to evaluate args 4 spawn_exec(executor, args, n): Spawns a new task to run the given executor 5 *-: De-references the given reference Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 37 / 45
  • 99. Example: Skywriting 1 function process_chunk(chunk, prev_result) { 2 return spawn_exec(...); 3 } 4 function is_converged(curr_result , prev_result) { 5 return spawn_exec(...)[0]; 6 } 7 input_data = [ref("ciel://host137/chunk0"), 8 ref("ciel://host223/chunk1"), ...]; 9 curr = ...; // Initial guess at the result. 10 do { 11 prev = curr; 12 curr = []; 13 for (chunk in input_data) { 14 curr += process_chunk(chunk, prev); 15 } 16 } while (!∗is_converged(curr, prev)); 17 return curr; Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 38 / 45
  • 100. Outline 1 Major shortcomings 2 Pig Latin 3 Dryad 4 CIEL 5 Naiad Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 39 / 45
  • 101. Introduction A class of applications requires support for both iterative and incremental computation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
  • 102. Introduction A class of applications requires support for both iterative and incremental computation For instance, to maintain in real-time the strongly connected component structure in the graph induced by Twitter mentions Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
  • 103. Introduction A class of applications requires support for both iterative and incremental computation For instance, to maintain in real-time the strongly connected component structure in the graph induced by Twitter mentions Currently, MapReduce itself has no support for either iterative or incremental computation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 40 / 45
  • 104. Naiad Data-intensive computing framework from Microsoft Research that supports both incremental and iterative computation by leveraging differential computation Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
  • 105. Naiad Data-intensive computing framework from Microsoft Research that supports both incremental and iterative computation by leveraging differential computation Differential computation adds two novelty factors to the framework: Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
  • 106. Naiad Data-intensive computing framework from Microsoft Research that supports both incremental and iterative computation by leveraging differential computation Differential computation adds two novelty factors to the framework: 1 The state of the computation varies according to a partially ordered set of versions rather than a total ordering Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
  • 107. Naiad Data-intensive computing framework from Microsoft Research that supports both incremental and iterative computation by leveraging differential computation Differential computation adds two novelty factors to the framework: 1 The state of the computation varies according to a partially ordered set of versions rather than a total ordering 2 The set of updates required to reconstruct the state at any version is retained in an indexed data-structure Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
  • 108. Naiad Data-intensive computing framework from Microsoft Research that supports both incremental and iterative computation by leveraging differential computation Differential computation adds two novelty factors to the framework: 1 The state of the computation varies according to a partially ordered set of versions rather than a total ordering 2 The set of updates required to reconstruct the state at any version is retained in an indexed data-structure The state and updates to that state are associated with a multi-dimensional logical timestamp (called a version) Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 41 / 45
  • 109. Programming environment Declarative query language based on the .NET Language Integrated Query (LINQ) Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
  • 110. Programming environment Declarative query language based on the .NET Language Integrated Query (LINQ) LINQ extends C# with declarative operators, such as Select, Where, Join, and GroupBy, among others Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
  • 111. Programming environment Declarative query language based on the .NET Language Integrated Query (LINQ) LINQ extends C# with declarative operators, such as Select, Where, Join, and GroupBy, among others Naiad adds two more operators: 1 FixedPoint that takes a source collection and a function that mutates the collection to another collection of the same type to achieve fixed-point convergence Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
  • 112. Programming environment Declarative query language based on the .NET Language Integrated Query (LINQ) LINQ extends C# with declarative operators, such as Select, Where, Join, and GroupBy, among others Naiad adds two more operators: 1 FixedPoint that takes a source collection and a function that mutates the collection to another collection of the same type to achieve fixed-point convergence 2 PrioritizedFP additionally takes a priority function to apply to every record in the source collection Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 42 / 45
  • 113. Runtime The Naiad runtime transforms declarative queries to a dataflow graph Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
  • 114. Runtime The Naiad runtime transforms declarative queries to a dataflow graph The user program can insert differences into the input collections and register callbacks to be invoked when differences are received at the output collection Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
  • 115. Runtime The Naiad runtime transforms declarative queries to a dataflow graph The user program can insert differences into the input collections and register callbacks to be invoked when differences are received at the output collection The runtime transparently distributes the execution of the data flow graph (similar to Dryad) across several cores and nodes Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 43 / 45
  • 116. References 1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD ’08). ACM, New York, NY, USA, 1099-1110. 2 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys ’07). ACM, New York, NY, USA, 59-72. 3 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI’11). USENIX Association, Berkeley, CA, USA. Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 44 / 45
  • 117. References (2) 4 Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard. Differential dataflow. 2013. In Conference on Innovative Data Systems Research (CIDR), 2013. Zubair Nabi 8: Enhancements and Alternative Architectures April 19, 2013 45 / 45