Effiziente Verarbeitung von grossen Datenmengen

Introduction

Approaches

Eﬀziente Verarbeitung von grossen Datenmengen
Teil II
Tristan Schneider

January 9, 2014

Eﬀziente Verarbeitung von grossen Datenmengen Teil II

Conclusion

Introduction

Inhalt
Introduction
Social Graph
Problems and Motivation
Approaches
TAO
Horton
Pregel
Trinity
Unicorn
Conclusion
Comparison
Future Work

Approaches

Conclusion

Introduction

Approaches

Social Graph

Consists of Nodes and Edges
Describes Entities and their Relation
Used by Facebook, Google, Amazon etc
About 100+ million nodes and 10+ billion edges


Conclusion

Introduction

Approaches

Problems and Motivation

amount of data exceeds capability of a single machine
necessary to distribute data and computation
data access managed by framework
diﬀerent requirements (latency, throughput)


Conclusion

Introduction

Approaches

TAO

developed by Facebook
read optimized
ﬁxed set of queries
Strength low latency access


Conclusion

Introduction

Approaches

TAO: Data Model

data identiﬁed by 64-bit integer
Objects (id) → (otype, (key → value)*)
Associations (id1, atype, id2) → (time, (key → value)*)


Conclusion

Introduction

Approaches

TAO: API

ﬁxed set of queries
assoc add, assoc delete, assoc change type
assoc get, assoc count, assoc range, assoc time range


Conclusion

Introduction

Approaches

TAO: Architecture

data divided into shard (via hashing)
each server handles one or more shard
objects and their associations are in the same shard
an object never changes the shard


Conclusion

Introduction

Approaches

TAO: Architecture

servers divided in leaders and followers
clients always communicate with followers
cache misses and writes redirected to leader
slave servers support master servers if necessary


Conclusion

Introduction

Approaches

TAO: Architecture: Scheme


Conclusion

Introduction

Approaches

TAO: Fault Tolerance and Performance

eﬃciency and availability > consistency
global mark for down server
followers are interchangeable
slave databases promoted to master, if master crashes


Conclusion

Introduction

Approaches

Conclusion

TAO: Fault Tolerance and Performance

Figure: Write Access Latencies
https://www.facebook.com/download/273893712748848/atc13-bronson.pdf

Introduction

Approaches

Horton

query language execution engine
written in C#
Strength interactive queries with low latency


Conclusion

Introduction

Approaches

Horton: Data Model

similar to TAO
divided in partitions
additional data can be attached (e.g. key-value-pairs)
directed edges stored at source and target


Conclusion

Introduction

Approaches

Horton: API

horton query language
initiated via client (library)


Conclusion

Introduction

Approaches

Horton: Architecture

Graph Client Library translates query to regular expression
Graph Coordinator translates regular expression to finite state
machine and finds most effective execution plan
Graph Partitions executes the finite state machine and traverses
the graph
Graph Manager provides an interface to administrate the graph


Conclusion

Introduction

Approaches

Pregel

C++ based
computation consists of parallel iteration
communication using messaging
Strength high throughput (for analysis)


Conclusion

Introduction

Approaches

Pregel: Data Model

graph divided in partitions
partition assignment based on node id (hash(id) mod n)


Conclusion

Introduction

Approaches

Pregel: API

implementation of a Vertex class (task)
deﬁne methods like Compute(...), SendMessageTo(...)


Conclusion

Introduction

Approaches

Pregel: Architecture

runs on a cluster management system
uses distributed ﬁle system (eg. Bigtable)


Conclusion

Introduction

Approaches

Pregel: Basic Work Flow

1. copy task to worker machines, one is promoted to master
2. master assigns one or more partitions to each worker
3. master invokes supersteps
4. save graph after computation


Conclusion

Introduction

Approaches

Conclusion

Pregel: Fault Tolerance and Performance

workers save their progress at checkpoint supersteps
worker failure detected using ping
reassign partitions failed servers to available workers
reload state of the most recent available checkpoint superstep
process termination if master failed


Introduction

Approaches

Pregel: Fault Tolerance and Performance

Figure: varying number of worker on 1 billion vertex binary tree
http://kowshik.github.io/JPregel/pregel paper.pdf

Conclusion

Introduction

Approaches

Trinity
developed by Microsoft
ﬂexible in data and computation
supports online query processing and oﬄine computation
on top well-connected cluster (memory cloud)
based on TFS (similar to HDFS)
Strength low latency and high throughput (not at the same
time)


Conclusion

Introduction

Approaches

Conclusion

Trinity: Data Model

key-value-store
one table for nodes
one table for each type of relation
relations represented by id-pairs in the speciﬁc table
customisation possible with Trinity Structure Language (TSL)
data backed up in persistent ﬁle system


Introduction

Approaches

Trinity: API

Trinity Desktop Environment (TDE)
supports query requests (similar to Horton/SQL)
supports oﬄine computation (similar to pregel)


Conclusion

Introduction

Approaches

Trinity: Architecture

Slaves Stores a part of the data, processes tasks and
messages.
Proxies Optional middle tier between slaves and clients.
Handles messages, does not store data.
Clients Responsible for user interaction with the cluster.


Conclusion

Introduction

Approaches

Trinity: Architecture

Figure: Trinity Cluster Structure
https://research.microsoft.com/pubs/161291/trinity.pdf

Conclusion

Introduction

Approaches

Conclusion

Trinity: Fault Tolerance and Performance

no ACID support, but atomicity of operations
dead machines are replaced by alive ones, reload memory from
TFS
requesting machine will wait till the dead machine is replaced
recovering the state of the most recent checkpoint superstep
(similar to pregel)


Introduction

Approaches

Trinity: Fault Tolerance and Performance

Figure: Response time of subgraph match queries

Conclusion

Introduction

Approaches

Unicorn

in-memory social graph-aware indexing system
search oﬀering backend of Facebook
based on Hadoop
Strength Typeahead
Good performance on complex queries.


Conclusion

Introduction

Approaches

Unicorn: Data Model

sharded data (similar to Facebooks TAO)
indices built and converted using custom Hadoop pipeline


Conclusion

Introduction

Approaches

Conclusion

Unicorn: API

Queries in Unicorn Query Language
e.g. (term likers:104076956295773))
≈ 6M Likers of ”Computer Science”
apply allows to query a (truncated) set of id and then use
those to construct a new query
extract attaches matches as metadata within the forward
index of the query set


Introduction

Approaches

Conclusion

Unicorn: Architeture

top-aggregator dispatches the query to one rack-aggregator of
each rack, combines and returns result
rack-aggregator forwards the query to all index servers of its rack
(high bandwidth), combines results
index server about 40-80 machines per rack, stores adjacency
lists, performs operations


Introduction

Approaches

Unicorn: Fault Tolerance and Performance

sharding and replication
automatically replacing machines
serving incomplete results is strongly preferable to serving
empty results


Conclusion

Introduction

Approaches

Conclusion

Unicorn: Fault Tolerance and Performance
(apply friend: likers:104076956295773) ≈ Friends of Likers of
”Computer Science”

https://www.facebook.com/download/138915572976390/UnicornVLDBﬁnal.pdf

Introduction

Approaches

Conclusion

Comparison

Framework
TAO
Horton
Pregel
Trinity
Unicorn

Query Language
no
yes
no
yes
yes


low latency
yes
yes
no
yes
yes

high throughput
no
no
yes
yes
no

Introduction

Approaches

Future Work

query language vs ﬁxed set queries
all-in-one framework diﬃcult (Trinity as best attempt)


Conclusion

Introduction

Approaches

Thank you for your attention.

Questions?
Sources
1.
2.
3.
4.
5.

http://research.microsoft.com/pubs/162643/icde12 demo 679.pdf
http://kowshik.github.io/JPregel/pregel paper.pdf
https://www.facebook.com/download/273893712748848/atc13-bronson.pdf
https://www.facebook.com/download/138915572976390/UnicornVLDB-ﬁnal.pdf


Conclusion

Effiziente Verarbeitung von grossen Datenmengen

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Effiziente Verarbeitung von grossen Datenmengen

Ähnlich wie Effiziente Verarbeitung von grossen Datenmengen (20)

Mehr von Florian Stegmaier

Mehr von Florian Stegmaier (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Effiziente Verarbeitung von grossen Datenmengen