3. MapReduce
● Specify a map and a reduce functions
● The system takes care of
○ parallelization
○ partitioning
○ scheduling
○ communication
○ fault-tolerance
3
5. MapReduce Limitations
● Static Pipeline
● No support for common operations
● Data materialization after every job
● Slow - not fit for interactive analysis
● Complex configuration
5
7. What is Hadoop/MR NOT good for?
● All the things it wasn't built for
○ Iterative computations
○ Stream processing
○ Incremental computations
○ Interactive Analysis
○ [insert research paper here]
7
11. Iterative Processing
● Characteristics
○ Datasets already stored
○ Need to reuse a dataset more than once, possibly
multiple times
○ Iterative jobs, e.g. estimates, convergence
● Problems with iterative MR applications
○ manual orchestration of several MR jobs
○ re-loading & re-processing of invariant data
○ no explicit way to define a termination condition
11
14. Programming Model
● Iterative Programming Model
Ri+1 = R0 U (Ri ⋈ L)
● Extensions to MR
○ loop body
○ termination condition
○ loop-invariant data
14
16. Caching and Indexing
● Reducer Input Cache
○ caches and indexes reducer inputs
○ reduces M->R I/O
● Reducer Output Cache
○ stores and indexes most recent local reducer
outputs
○ reduces termination condition computation cost
● Mapper Input Cache
○ avoids non-local data reads in mappers
16
17. Stream Processing
● Characteristics
○ Data continuously comes into the system
○ Usually needs to be processed as it arrives
○ Frequent updates
● Problems with stream MR applications
○ runs on a static snapshot os a dataset
○ computations need to finish
17
19. Programming Model
● MapUpdate
○ operates on streams, i.e. sequence of events with
the same id in increasing timestamp order
● Slates
○ in-memory data structures which "summarize"
all events with key k that an Update function has
seen so far
19
20. Example Applications
● An application that monitors the
FourSquare-checkin stream to count the
number of checkins per retailer and
displays the count on a Web page
● Detect "hot" topics in Twitter
20
27. Incremental MapReduce
● Incremental Map
○ persistently store intermediate results
○ insert reference to memoization server
○ query memoization server and fetch result if
already computed
● Incremental Reduce
○ persistently store entire tasks computations
○ store and map sub-computations used in the
Contraction phase
27
28. Contraction Phase
● Break up large Reduce tasks into many
applications of the Combine function
● Only a subset of Combiners needs to be
re-executed
28
31. Upload Pipeline
● HDFS upload pipeline is changed so that:
○ the Client creates PAX blocks
○ Datanodes do not flush data or checksums to
disk
○ After all chunks of a block have been received,
the block is sorted in memory and flushed
○ Each DataNode computes its own checksums
31
32. Query Pipeline
Transparency is achieved using UDFs:
● HailInputFormat
○ elaborate splitting policy
○ scheduling taked into account relevant indexes
● HailRecordReader
○ Uses user annotation / configuration info to
select records for map phase
○ transforms records from PAX to row format
32
34. How to limit Disk I/O?
● Process records in memory and spill to
disk as rarely as possible
● Relax fault-tolerance guarantees
○ job-level recovery
● Dynamic memory management
○ pluggable policies
● Per-node I/O management
○ organize data in large batches
34
35. Memory policies
● Pool-based
○ fixed-sized pre-allocated buffers
● Quota-based
○ controls dataflow between computational stages
using queues
● Constraint-based
○ dynamically adjusts memory allocation based on
requests and available memory
35
36. System Overview
Data-flow graph consisting of stages:
● Phase Zero extracts information about
distribution of records and keys
● Phase One implements mapping and
shuffling
● Phase Two implements the sorting and
reduce, always keeping results in
memory
36
38. System Overview
● Built as an extension to Pig
● When a workflow is submitted, ReStore:
○ re-writes the query to reuse stored results
○ stores outputs of the workflow
○ stores results of sub-jobs
○ decided which outputs to store in HDFS and
which to delete
38
42. Idea
● Apply well-known query optimization
techniques to Map-Reduce jobs
● Static analysis of compiled code
● Apply optimizations only when "safe"
42
44. Example Optimizations
● Selection
○ if the map function is a filter, use a B+Tree to
only scan the relevant portion of the input
● Projection
○ eliminate unnecessary fields from input records
44
46. Common Types of Skew
● Uneven distribution of input data
○ partitioning which does not guarantee even
distribution
○ popular key groups
● Expensive records
○ some portions of the input take longer to process
than others
46
47. System Overview
● Per-task progress estimation
● Per-task statistics
● Late skew detection
○ skew mitigation is delayed until a slot is
available
● Only re-partition one task at a time
○ only when half the time remaining is less than
the re-partitioning overhead
47
48. Implementation
Re-partition a map task
● mitigators execute as mappers within a
new MapReduce job
● output is written to HDFS
Re-partition a reduce task
● mitigator job with an identity map read
input from task tracker
48
51. Job-Level Tuning
● Just-in-Time Optimizer
○ choose efficient execution techniques, e.g. joins
● Profiler
○ learns performance models, job profiles
● Sampler
○ collects statistics about input, intermediate and
output data
○ helps the profiler build approximate models
51
52. Workflow-Level Tuning
● Workflow-aware Scheduler
○ exploring data locality on workflow-level instead
of making locally optimal decisions
● What-If Engine
○ answers questions based on simulations of job
executions
52
53. Workload-Level Tuning
● Workload Optimizer
○ Data-flow sharing
○ Materialization of intermediate results for reuse
○ Reorganization
● Elastisizer
○ node and network configuration automation
53
59. Graph Optimizations
● Schedule vertices clode to the input data
● If a computation is associative and
commutative, use an aggregation tree
● Dynamically refine the graph based on
output data sizes
○ vary number of vertices in each stage,
connectivity
59
62. SCOPE scripting language
● resembles SQL with C# expressions
● commands are data transformation
operators
● extensible mapreduce-like commands
62
63. SCOPE Execution
● The Compiler creates internal parse tree
● The Optimizer creates a parallel
execution plan, i.e. a Cosmos job
● The Job Manager constructs the graph
and schedules execution
63
65. RDDs
Read-only collection of objects
● partitioned across machines
● store their "lineage"
● can be re-constructed
● users can control persistence and
partitioning
65
66. Programming Model
● Scala API
● driver program
○ defines RDDs and
actions on them
● workers
○ long-lived processes
○ store and process RRD
partitions in-memory
66
70. System Overview
● Execution plan in the form of a DAG
● Abstracts parallelization and
communication
● Optimizer to choose best execution
strategy
70
71. Programming Model
● Input Contracts:
○ give guarantees on how data is organized into
independent subsets
○ Map, Reduce, Match, Cross, CoGroup
● Output Contracts:
○ define properties on the output data
○ Same-Key, Super-Key, Unique-Key
71
73. Evolving World Model
● As-of queries
○ What is the best route to get to the Olympic
Stadium right now?
○ What is the traffic situation like on Saturday
nights close to the city center?
○ How many visitors that visited the City Hall
during the past year also went for dinner in that
nearby restaurant?
73
74. Data Model - Query Language
● Semi-structured data model, ADM
○ dataset ~ table: indexed, partitioned, replicated
○ dataverse ~ database
○ DDL: primary key, partitioning key
○ "open" data schemes
● AQL query language
○ declarative, inspired from Jaql and XQuery
○ logical plan -> DAG -> Hyracks Job
74
77. Columnar Storage
● lossless representation
○ save field types, repetition/definition levels
● fast encoding
○ recursively traverses record and computes levels
● efficient record assembly
○ use a FSM to reconstruct records
77
78. Query Execution
● Language based on SQL
● Tree architecture
○ Root server
■ receives incoming queries
■ reads table metadata
■ routes queries to the next level of the tree
○ Leaf servers
■ communicate with storage layer
78
79. Query Dispatcher
● Schedules queries to available slots
● Balances the load
● Assures fault-tolerance
● Specifies what percentage of tablets to be
scanned before returning a result
79
83. Skywriting Language
● Turing-complete
● Arbitrary data-dependent control flow
○ while loops
○ recursive functions
● Supports invokation of code written in
other languages
83