Gluecon miller horizon

NEARING THE EVENT HORIZON.
HADOOP WAS PREDICTABLE, WHAT’S NEXT?

May 23, 2012 Mike Miller
mike@cloudant.com
@mlmilleratmit

What I Am

Cloudant Founder, Chief Scientist
(we’re hiring at all positions)

Aﬃliate Assistant Professor, Particle Physics(UW)

Background: machine learning, analysis, big data,
globally distributed systems

Mike Miller, GlueCon May 2012 2

What I Am

A CDN for your Application Data

What I Am Not

didn’t see these coming
Super luminal neutrinos
Red Sox epic collapse in September
Red Wings losing in the ﬁrst round
...

But here I go anyway


My First Postulate of Big-Data

Google Matters

What matters for google...
... matters for the internet...
...and therefore matters for the enterprise...
... will therefore be re-architected by Apache...
... and therefore matters to you.


Evidence

Business Week, 12/24/2007


The Old Canon
• Google File System (the important one)
http://labs.google.com/papers/gfs.html

• MapReduce (the big one)
http://labs.google.com/papers/mapreduce.html

• BigTable (clone me!)
http://labs.google.com/papers/bigtable.html

• Dynamo (ok, AWS. but masterless quorum)
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

copy these. use these. print $$$

MapReduce: The Awesome
• Approachable interface
“What do I do with a single piece of data?”

• Data Parallel
Developers can basically forget about scatter-gather

• Fault Tolerant
Failure at scale is the norm!
Protects both user and system operator

• IO Optimized
Built for sequential IO
commodity disks spinning forward at O(20 MB/sec) each


So... is that it?

http://gigaom.com/cloud/democratizing-big-data-is-hadoop-our-only-hope/


So... is that it?


http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/


So... is that it?


http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/

http://mackiemathew.com/2012/02/25/the-problems-in-hadoop-when-does-it-fail-to-deliver/

MapReduce: The not so Awesome
• Hadoop doesn’t power big data applications
Not a transactional datastore. Slosh back and forth via ETL

• Processing latency
Non-incremental, must re-slurp entire dataset every pass

• Ad-Hoc queries
Bare metal interface, data import

• Graphs
Only a handful of graph problems amenable to MR
http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.120


To the Event Horizon


Enter The New Canon
• Percolator
incremental processing
http://research.google.com/pubs/pub36726.html

• Dremel
ad-hoc analysis queries
http://research.google.com/pubs/pub36632.html

• Pregel
Big graphs
http://dl.acm.org/citation.cfm?id=1807184

Scalable, Fault Tolerant, Approachable


Percolator


Percolator: incremental processing
• Replaced MapReduce as the tool to build search index
“However, reprocessing the entire web discards the work done in earlier runs and makes latency
proportional to the size of the repository, rather than the size of the update.”

• Bigtable alone can’t do it
“BigTable scales...but doesn’t provide tools to help programmers maintain data invariants in the
face of concurrent updates.”

• Applicability
Incrementally updating data
Computational output can be broken down into small pieces
Computation large in some dimension (data size, cpu, etc)

• Does it matter?
“...Converting the indexing system to an incremental system ... reduced the averaging document
processing latency by a factor of 100...”


• BigTable plus...
Multi-row ACID Transactions
snapshot isolation, lazy locks
up to 10s write latencies

Timestamps

Notiﬁcations Start Timestamp (read)
Do not maintain invariants
Commit Timestamp (write)
Observer Framework
your code to be run upon notiﬁcation of an update



Near Linear Scaling to 15k Cores


Latency lower than MapReduce by 100x

Dremel


Dremel: ad-hoc Query
• Scalable, interactive ad-hoc query system for read-only nested data
“...capable of running aggregation queries over trillion-row tables in seconds.”

• ... on nested data structures in situ
Web and scientiﬁc data is often non-relational
nested data (protobuﬀs) underlies most structured data at Google

• Usage
DEFINE TABLE t AS /path/to/data/*
SELECT TOP(signal1,100), COUNT(*) FROM t

• Applicability
Analysis of crawled documents
Tracking of install data for apps on Android Market
Crash reports
Spam analysis...

Dream BI Tool

• Ingredients
In situ data
SQL like interface
Serving trees for query execution
Column striped data (3-10x)
Analysis Catalogs



Columns ~10x faster than Records 21
Mike Miller, GlueCon May 2012


Benchmark Data MapReduce (via Sawzall)

Dremel (via SQL)



Signiﬁcant Optimization Possible

Dremel ~100x Faster than Stock MR



Most Production Queries Executed in <10 seconds


Pregel


Pregel: Big Graphs
• Massively parallel processing of big graphs
billions of vertices, trillions of edges

• Bulk synchronous parallel model
sequence of vertex oriented iterations
send/receive messages from other vertex computations
read/modify state of vertex, outgoing edges, graph topology

• Expressive, easy to program
distribution details hidden behind abstract API

• Iterative
computation continues until each vertex votes to terminate

• In production
PageRank 15 lines of code


Pregel: Big Graphs
• Master “Name” node
connects processes for messaging

• Message Passing
no remote procedures, reads

• Graph hashed across nodes
vertex, outgoing edges stored in RAM

• Aggregators
global mechanism for aggregation
all but ﬁnal reduce computed on node local data

• Checkpointing
conﬁgurable, enables automatic recovery


Pregel: Big Graphs


Pregel: Big Graphs

Near Linear Scaling to 1B nodes

Learn More
• Incremental Processing
Incremental, in-database map/reduce in Cloudant’s BigCouch
HBase 0.92 supports observers/coprocessors
Stream processing via Storm, HStreaming, etc.

• Ad Hoc Query
Google BigQuery
Column stores (Vertica, etc)
OpenDremel (stalled?)
?

• Big Graphs
Giraph on Hadoop (Apache Incubator)
Golden Orb (stalled?)


Lessons Learned

• Hire Jeﬀ Dean and Sanjay Ghemawat
• GFS enables everything
• There is massive opportunity on the horizon


Gluecon miller horizon

Recommended

Recommended

More Related Content

Similar to Gluecon miller horizon

Similar to Gluecon miller horizon (20)

Recently uploaded

Recently uploaded (20)

Gluecon miller horizon