Big data processing systems research

Big Data Processing
Systems Study
Vasiliki Kalavri, emjd-dc
3 Dec 2012

MapReduce
simplified Data Processing on Large
Clusters
OSDI 2004

MapReduce
● Specify a map and a reduce functions
● The system takes care of
○ parallelization
○ partitioning
○ scheduling
○ communication
○ fault-tolerance
3

MapReduce Limitations
● Static Pipeline
● No support for common operations
● Data materialization after every job
● Slow - not fit for interactive analysis
● Complex configuration
5

What is Hadoop/MR NOT good for?
● All the things it wasn't built for
○ Iterative computations
○ Stream processing
○ Incremental computations
○ Interactive Analysis
○ [insert research paper here]
7

Improving Hadoop performance
● Reduce Network & Disk I/O
● Skewed Datasets
● DB-like optimizations
○ column-oriented storage
○ indexes
8

Map-Reduce Inspired
Systems
Extending the Programming Model

Map-Reduce Inspired Systems
● Extend the programming model to
support
○ Iterative
○ Stream
applications
10

Iterative Processing
● Characteristics
○ Datasets already stored
○ Need to reuse a dataset more than once, possibly
multiple times
○ Iterative jobs, e.g. estimates, convergence
● Problems with iterative MR applications
○ manual orchestration of several MR jobs
○ re-loading & re-processing of invariant data
○ no explicit way to define a termination condition
11

HaLoop
Efficient Iterative Data Processing on
Large Clusters
VLDB 2010

Programming Model
● Iterative Programming Model
Ri+1 = R0 U (Ri ⋈ L)
● Extensions to MR
○ loop body
○ termination condition
○ loop-invariant data
14

Loop-Aware Scheduling
● Inter-Iteration Locality
○ schedule tasks of different iterations which
access the same data on the same machines
15

Caching and Indexing
● Reducer Input Cache
○ caches and indexes reducer inputs
○ reduces M->R I/O
● Reducer Output Cache
○ stores and indexes most recent local reducer
outputs
○ reduces termination condition computation cost
● Mapper Input Cache
○ avoids non-local data reads in mappers
16

Stream Processing
● Characteristics
○ Data continuously comes into the system
○ Usually needs to be processed as it arrives
○ Frequent updates
● Problems with stream MR applications
○ runs on a static snapshot os a dataset
○ computations need to finish
17

Muppet
MapReduce-Style Processing of Fast Data
VLDB 2012

Programming Model
● MapUpdate
○ operates on streams, i.e. sequence of events with
the same id in increasing timestamp order
● Slates
○ in-memory data structures which "summarize"
all events with key k that an Update function has
seen so far
19

Example Applications
● An application that monitors the
FourSquare-checkin stream to count the
number of checkins per retailer and
displays the count on a Web page
● Detect "hot" topics in Twitter
20

System Overview
● Uses Cassandra to persist slate states
21

Map-Reduce Inspired
Systems
Improving Performance

Map-Reduce Inspired Systems
● Improve performance by
○ reusing data
○ building caches / indexes
○ DBMS-like optimizations
○ reducing I/O
23

Incoop
MapReduce for Incremental Computations
SOCC 2011

Inc-HDFS
● Content-based chunking
● Fingerprint calculation
26

Incremental MapReduce
● Incremental Map
○ persistently store intermediate results
○ insert reference to memoization server
○ query memoization server and fetch result if
already computed
● Incremental Reduce
○ persistently store entire tasks computations
○ store and map sub-computations used in the
Contraction phase
27

Contraction Phase
● Break up large Reduce tasks into many
applications of the Combine function
● Only a subset of Combiners needs to be
re-executed
28

HAIL
Only Aggressive Elephants are Fast Elephants
VLDB 2012

Upload Pipeline
● HDFS upload pipeline is changed so that:
○ the Client creates PAX blocks
○ Datanodes do not flush data or checksums to
disk
○ After all chunks of a block have been received,
the block is sorted in memory and flushed
○ Each DataNode computes its own checksums
31

Query Pipeline
Transparency is achieved using UDFs:
● HailInputFormat
○ elaborate splitting policy
○ scheduling taked into account relevant indexes
● HailRecordReader
○ Uses user annotation / configuration info to
select records for map phase
○ transforms records from PAX to row format
32

Themis
An I/O Efficient MapReduce
SOCC 2012

How to limit Disk I/O?
● Process records in memory and spill to
disk as rarely as possible
● Relax fault-tolerance guarantees
○ job-level recovery
● Dynamic memory management
○ pluggable policies
● Per-node I/O management
○ organize data in large batches
34

Memory policies
● Pool-based
○ fixed-sized pre-allocated buffers
● Quota-based
○ controls dataflow between computational stages
using queues
● Constraint-based
○ dynamically adjusts memory allocation based on
requests and available memory
35

System Overview
Data-flow graph consisting of stages:
● Phase Zero extracts information about
distribution of records and keys
● Phase One implements mapping and
shuffling
● Phase Two implements the sorting and
reduce, always keeping results in
memory
36

ReStore
Reusing Results of MapReduce Jobs
VLDB 2012

System Overview
● Built as an extension to Pig
● When a workflow is submitted, ReStore:
○ re-writes the query to reuse stored results
○ stores outputs of the workflow
○ stores results of sub-jobs
○ decided which outputs to store in HDFS and
which to delete
38

MANIMAL
Automatic Optimization for MapReduce
Programs
VLDB 2011

Idea
● Apply well-known query optimization
techniques to Map-Reduce jobs
● Static analysis of compiled code
● Apply optimizations only when "safe"
42

Example Optimizations
● Selection
○ if the map function is a filter, use a B+Tree to
only scan the relevant portion of the input
● Projection
○ eliminate unnecessary fields from input records
44

SkewTune
Mitigating Skew in MapReduce Applications
SIGMOD 2012

Common Types of Skew
● Uneven distribution of input data
○ partitioning which does not guarantee even
distribution
○ popular key groups
● Expensive records
○ some portions of the input take longer to process
than others
46

System Overview
● Per-task progress estimation
● Per-task statistics
● Late skew detection
○ skew mitigation is delayed until a slot is
available
● Only re-partition one task at a time
○ only when half the time remaining is less than
the re-partitioning overhead
47

Implementation
Re-partition a map task
● mitigators execute as mappers within a
new MapReduce job
● output is written to HDFS
Re-partition a reduce task
● mitigator job with an identity map read
input from task tracker
48

Starfish
A Self-Tuning System for Big Data Analytics
CIDR 2011

Job-Level Tuning
● Just-in-Time Optimizer
○ choose efficient execution techniques, e.g. joins
● Profiler
○ learns performance models, job profiles
● Sampler
○ collects statistics about input, intermediate and
output data
○ helps the profiler build approximate models
51

Workflow-Level Tuning
● Workflow-aware Scheduler
○ exploring data locality on workflow-level instead
of making locally optimal decisions
● What-If Engine
○ answers questions based on simulations of job
executions
52

Workload-Level Tuning
● Workload Optimizer
○ Data-flow sharing
○ Materialization of intermediate results for reuse
○ Reorganization
● Elastisizer
○ node and network configuration automation
53

Big-Data Processing
Beyond MapReduce

Dryad
Distributed Data-Parallel Programs from
Sequential Building Blocks
EuroSys 2007

Graph Optimizations
● Schedule vertices clode to the input data
● If a computation is associative and
commutative, use an aggregation tree
● Dynamically refine the graph based on
output data sizes
○ vary number of vertices in each stage,
connectivity
59

SCOPE
Easy and Efficient Parallel Processing of
Massive Data Sets
VLDB 2008

SCOPE scripting language
● resembles SQL with C# expressions
● commands are data transformation
operators
● extensible mapreduce-like commands
62

SCOPE Execution
● The Compiler creates internal parse tree
● The Optimizer creates a parallel
execution plan, i.e. a Cosmos job
● The Job Manager constructs the graph
and schedules execution
63

Spark
Cluster Computing with Working Sets
HotCloud 2010

RDDs
Read-only collection of objects
● partitioned across machines
● store their "lineage"
● can be re-constructed
● users can control persistence and
partitioning
65

Programming Model
● Scala API
● driver program
○ defines RDDs and
actions on them
● workers
○ long-lived processes
○ store and process RRD
partitions in-memory
66

Nephele/PACTs
A Programming Model and Execution
Framework for Web-Scale Analytical
Processing
SoCC 2010

System Overview
● Execution plan in the form of a DAG
● Abstracts parallelization and
communication
● Optimizer to choose best execution
strategy
70

Programming Model
● Input Contracts:
○ give guarantees on how data is organized into
independent subsets
○ Map, Reduce, Match, Cross, CoGroup
● Output Contracts:
○ define properties on the output data
○ Same-Key, Super-Key, Unique-Key
71

ASTERIX
Scalable, Semi-structured Data Platform for
Evolving-World Models
Distributed and
Parallel Databases 2011

Evolving World Model
● As-of queries
○ What is the best route to get to the Olympic
Stadium right now?
○ What is the traffic situation like on Saturday
nights close to the city center?
○ How many visitors that visited the City Hall
during the past year also went for dinner in that
nearby restaurant?
73

Data Model - Query Language
● Semi-structured data model, ADM
○ dataset ~ table: indexed, partitioned, replicated
○ dataverse ~ database
○ DDL: primary key, partitioning key
○ "open" data schemes
● AQL query language
○ declarative, inspired from Jaql and XQuery
○ logical plan -> DAG -> Hyracks Job
74

Dremel
Interactive analysis of Web-Scale Datasets
VLDB 2010

Columnar Storage
● lossless representation
○ save field types, repetition/definition levels
● fast encoding
○ recursively traverses record and computes levels
● efficient record assembly
○ use a FSM to reconstruct records
77

Query Execution
● Language based on SQL
● Tree architecture
○ Root server
■ receives incoming queries
■ reads table metadata
■ routes queries to the next level of the tree
○ Leaf servers
■ communicate with storage layer
78

Query Dispatcher
● Schedules queries to available slots
● Balances the load
● Assures fault-tolerance
● Specifies what percentage of tablets to be
scanned before returning a result
79

CIEL
a universal execution engine for distributed
data-flow computing
NSDI 2011

Skywriting Language
● Turing-complete
● Arbitrary data-dependent control flow
○ while loops
○ recursive functions
● Supports invokation of code written in
other languages
83

References
www.citeulike.org/user/vasiakalavri
84

Big data processing systems research

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Big data processing systems research

Ähnlich wie Big data processing systems research (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big data processing systems research