6. Data, Data Everywhere
⢠You guys generate a lot of data
⢠Anybody want to guess?
⢠7 TB/day (2+ PB/yr)
7. Data, Data Everywhere
⢠You guys generate a lot of data
⢠Anybody want to guess?
⢠7 TB/day (2+ PB/yr)
⢠10,000 CDs
8. Data, Data Everywhere
⢠You guys generate a lot of data
⢠Anybody want to guess?
⢠7 TB/day (2+ PB/yr)
⢠10,000 CDs
⢠5 million floppy disks
9. Data, Data Everywhere
⢠You guys generate a lot of data
⢠Anybody want to guess?
⢠7 TB/day (2+ PB/yr)
⢠10,000 CDs
⢠5 million floppy disks
⢠225 GB while I give this talk
11. Syslog?
⢠Started with syslog-ng
⢠As our volume grew, it didnât scale
⢠Resources
overwhelmed
⢠Lost data
12. Scribe
⢠Surprise! FB had same problem, built
and open-sourced Scribe
⢠Log collection framework over Thrift
⢠You write log lines, with categories
⢠It does the rest
13. Scribe
FE FE FE
⢠Runs locally; reliable
in network outage
14. Scribe
FE FE FE
⢠Runs locally; reliable
in network outage
⢠Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
15. Scribe
FE FE FE
⢠Runs locally; reliable
in network outage
⢠Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
⢠Pluggable outputs
File HDFS
16. Scribe at Twitter
⢠Solved our problem, opened new
vistas
⢠Currently 30 different categories
logged from javascript, RoR, Scala, etc
⢠We improved logging, monitoring,
writing to Hadoop, compression
17. Scribe at Twitter
⢠Continuing to work with FB
⢠GSoC project! Help make it more
awesome.
⢠http://github.com/traviscrawford/scribe
⢠http://wiki.developers.facebook.com/index.php/User:GSoC
19. How do you store 7TB/day?
⢠Single machine?
⢠Whatâs HD write speed?
20. How do you store 7TB/day?
⢠Single machine?
⢠Whatâs HD write speed?
⢠80 MB/s
21. How do you store 7TB/day?
⢠Single machine?
⢠Whatâs HD write speed?
⢠80 MB/s
⢠24.3 hrs to write 7 TB
22. How do you store 7TB/day?
⢠Single machine?
⢠Whatâs HD write speed?
⢠80 MB/s
⢠24.3 hrs to write 7 TB
⢠Uh oh.
23. Where do I put 7TB/day?
⢠Need a cluster of
machines
24. Where do I put 7TB/day?
⢠Need a cluster of
machines
⢠... which adds new
layers of complexity
25. Hadoop
⢠Distributed file system
⢠Automatic replication, fault
tolerance
⢠MapReduce-based parallel computation
⢠Key-value based computation
interface allows for wide applicability
26. Hadoop
⢠Open source: top-level Apache project
⢠Scalable: Y! has a 4000 node cluster
⢠Powerful: sorted 1TB random integers
in 62 seconds
⢠Easy packaging: free Cloudera RPMs
27. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
28. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
29. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
30. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
31. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
32. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
33. Inputs
MapReduce Workflow
ShufďŹe/
Map Sort
⢠Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce ⢠Input: key=row, value=tweet info
⢠Map: output key=user_id, value=1
Map Reduce
⢠Shuffle: sort by user_id
Map Reduce ⢠Reduce: for each user_id, sum
Map
⢠Output: user_id, tweet count
⢠With 2x machines, runs 2x faster
Map
34. Two Analysis Challenges
1. Compute friendships in Twitterâs social
graph
⢠grep, awk? No way.
⢠Data is in MySQL... self join on an n-
billion row table?
⢠n,000,000,000 x n,000,000,000 = ?
35. Two Analysis Challenges
1. Compute friendships in Twitterâs social
graph
⢠grep, awk? No way.
⢠Data is in MySQL... self join on an n-
billion row table?
⢠n,000,000,000 x n,000,000,000 = ?
⢠I donât know either.
36. Two Analysis Challenges
2. Large-scale grouping and counting
⢠select count(*) from users? maybe.
⢠select count(*) from tweets? uh...
⢠Imagine joining them.
⢠And grouping.
⢠And sorting.
37. Back to Hadoop
⢠Didnât we have a cluster of machines?
⢠Hadoop makes it easy to distribute the
calculation
⢠Purpose-built for parallel calculation
⢠Just a slight mindset adjustment
38. Back to Hadoop
⢠Didnât we have a cluster of machines?
⢠Hadoop makes it easy to distribute the
calculation
⢠Purpose-built for parallel calculation
⢠Just a slight mindset adjustment
⢠But a fun one!
39. Analysis at Scale
⢠Now weâre rolling
⢠Count all tweets: 12 billion, 5 minutes
⢠Hit FlockDB in parallel to assemble
social graph aggregates
⢠Run pagerank across users to calculate
reputations
40. But...
⢠Analysis typically in Java
⢠Single-input, two-stage data flow is rigid
⢠Projections, filters: custom code
⢠Joins lengthy, error-prone
⢠n-stage jobs: hard to manage
⢠Exploration requires compilation
47. Pig Makes it Easy
⢠5% of the code
⢠5% of the dev time
48. Pig Makes it Easy
⢠5% of the code
⢠5% of the dev time
⢠Within 25% of the running time
49. Pig Makes it Easy
⢠5% of the code
⢠5% of the dev time
⢠Within 25% of the running time
⢠Readable, reusable
50. One Thing Iâve Learned
⢠Itâs easy to answer questions.
⢠Itâs hard to ask the right questions
51. One Thing Iâve Learned
⢠Itâs easy to answer questions.
⢠Itâs hard to ask the right questions.
⢠Value the system that promotes
innovation and iteration
52. One Thing Iâve Learned
⢠Itâs easy to answer questions.
⢠Itâs hard to ask the right questions.
⢠Value the system that promotes
innovation and iteration
⢠More minds contributing = more value
from your data
55. Counting Big Data
⢠How many requests per day?
⢠Average latency? 95% latency?
⢠Response code distribution per hour?
56. Counting Big Data
⢠How many requests per day?
⢠Average latency? 95% latency?
⢠Response code distribution per hour?
⢠Searches per day?
57. Counting Big Data
⢠How many requests per day?
⢠Average latency? 95% latency?
⢠Response code distribution per hour?
⢠Searches per day?
⢠Unique users searching, unique queries?
58. Counting Big Data
⢠How many requests per day?
⢠Average latency? 95% latency?
⢠Response code distribution per hour?
⢠Searches per day?
⢠Unique users searching, unique queries?
⢠Geographic distribution of queries?
60. Correlating Big Data
⢠Usage difference for mobile users?
⢠... for users on desktop clients?
61. Correlating Big Data
⢠Usage difference for mobile users?
⢠... for users on desktop clients?
⢠Cohort analyses
62. Correlating Big Data
⢠Usage difference for mobile users?
⢠... for users on desktop clients?
⢠Cohort analyses
⢠What features get users hooked?
63. Correlating Big Data
⢠Usage difference for mobile users?
⢠... for users on desktop clients?
⢠Cohort analyses
⢠What features get users hooked?
⢠What do successful users use often?
64. Research on Big Data
⢠What can we tell from a userâs tweets?
65. Research on Big Data
⢠What can we tell from a userâs tweets?
⢠... from the tweets of their followers?
66. Research on Big Data
⢠What can we tell from a userâs tweets?
⢠... from the tweets of their followers?
⢠... from the tweets of those they follow?
67. Research on Big Data
⢠What can we tell from a userâs tweets?
⢠... from the tweets of their followers?
⢠... from the tweets of those they follow?
⢠What influences retweet tree depth?
68. Research on Big Data
⢠What can we tell from a userâs tweets?
⢠... from the tweets of their followers?
⢠... from the tweets of those they follow?
⢠What influences retweet tree depth?
⢠Duplicate detection, language detection
69. Research on Big Data
⢠What can we tell from a userâs tweets?
⢠... from the tweets of their followers?
⢠... from the tweets of those they follow?
⢠What influences retweet tree depth?
⢠Duplicate detection, language detection
⢠Machine learning
70. If We Had More Time...
⢠HBase backing namesearch
⢠LZO compression
⢠Protocol Buffers and Hadoop
⢠Our open source: hadoop-lzo, elephant-
bird
⢠Realtime analytics with Cassandra