Big Data at Twitter, Chirp 2010

Big Data at Twitter
#chirpdata

Kevin Weil
@kevinweil

Twitter Analytics

Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data

Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?

• 7 TB/day (2+ PB/yr)

• 10,000 CDs

• 10,000 CDs
• 5 million floppy disks

• 10,000 CDs
• 5 million floppy disks
• 225 GB while I give this talk

Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale

Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
overwhelmed
• Lost data

Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest

Scribe
FE FE FE
• Runs locally; reliable
in network outage

Scribe
FE FE FE
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable

Scribe
FE FE FE
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
• Pluggable outputs
File HDFS

Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression

Scribe at Twitter
• Continuing to work with FB
• GSoC project! Help make it more
awesome.

• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC

How do you store 7TB/day?
• Single machine?
• What’s HD write speed?

• Single machine?
• 80 MB/s

• Single machine?
• 80 MB/s
• 24.3 hrs to write 7 TB

• Single machine?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.

Where do I put 7TB/day?
• Need a cluster of
machines

Where do I put 7TB/day?
• Need a cluster of
machines

• ... which adds new
layers of complexity

Hadoop
• Distributed file system
• Automatic replication, fault
tolerance
• MapReduce-based parallel computation
• Key-value based computation
interface allows for wide applicability

Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds

• Easy packaging: free Cloudera RPMs

Inputs
MapReduce Workflow
Shufﬂe/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map

Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?

1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?
• I don’t know either.

2. Large-scale grouping and counting
• select count(*) from users? maybe.
• select count(*) from tweets? uh...
• Imagine joining them.
• And grouping.
• And sorting.

Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment

Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!

Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations

But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation

Pig
• High level language
• Transformations on
sets of records
• Process data one
step at a time
• Easier than SQL?

Why Pig?
Because I bet you can read
the following script

A Real Pig Script

• Just for fun... the same calculation in Java

Pig Makes it Easy
• 5% of the code

Pig Makes it Easy
• 5% of the code
• 5% of the dev time

Pig Makes it Easy
• 5% of the code
• Within 25% of the running time

Pig Makes it Easy
• 5% of the code
• Within 25% of the running time
• Readable, reusable

One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions

• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration

• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data

Counting Big Data
• How many requests per day?

Counting Big Data
• Average latency? 95% latency?

Counting Big Data
• Response code distribution per hour?

Counting Big Data
• Searches per day?

Counting Big Data
• Unique users searching, unique queries?

Counting Big Data
• Unique users searching, unique queries?
• Geographic distribution of queries?

Correlating Big Data
• Usage difference for mobile users?

• ... for users on desktop clients?

• Cohort analyses

• Cohort analyses
• What features get users hooked?

• Cohort analyses
• What features get users hooked?
• What do successful users use often?

Research on Big Data
• What can we tell from a user’s tweets?

• ... from the tweets of their followers?

• ... from the tweets of those they follow?

• What influences retweet tree depth?

• Duplicate detection, language detection

• Duplicate detection, language detection
• Machine learning

If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra

Questions?
Follow me at
twitter.com/kevinweil

Big Data at Twitter, Chirp 2010

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Big Data at Twitter, Chirp 2010

Ähnlich wie Big Data at Twitter, Chirp 2010 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data at Twitter, Chirp 2010

Hinweis der Redaktion