From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Couchbase Meetup Jan 2016
1.
2. Michael Kehoe
Senior Site Reliability Engineer
LinkedIn
LinkedIn’s Big Data Pipeline
with Kafka, Hadoop and
Couchbase
3. 3
$ whoami
Michael Kehoe
• Sr Site Reliability Engineer
(SRE)
• Member of CBVT
• B.E. (Electrical Engineering)
from
the University of Queensland,
Australia
4. 4
Kafka @ LinkedIn
• Kafka was created by LinkedIn
• Kafka is a publish-subscribe
system as a distributed commit
log
• Processes 500+ TB/ day (~500
billion messages)
5. 5
LinkedIn’s use of Kafka
• Monitoring
• Pub-Sub Messaging
• Analytics
• Building block for (log) distributed application
• Samza
• Espresso
• Pinot
6. Kafka to Hadoop (Analytics)
6
Use Case
• LinkedIn tracks data to better understand how members use our
products
• Information such as which page got viewed and which content got
clicked on are sent into a Kafka cluster in each data center
• Some of these events are all centrally collected and pushed onto
our Hadoop grid for analysis and daily report generation
7. 7
Couchbase @ LinkedIn
• About 80 separate services with one or more clusters in multiple data
centers
• Up to ~70 servers in a cluster
• Single & Multi-tenant clusters
8. 8
Hadoop to Couchbase
• Our primary use-case for Hadoop Couchbase is for building
(warming) / restoring Couchbase buckets
• LinkedIn built it’s own in-house solution to work with our ETL
processes etc
9. Jobs Cluster
9
Clusters & Numbers
• Used for read-scaling, > 150k QPS, 27 node clusters
• We use Hadoop to pre-build data by partition
• Couchbase average latency is 2-3ms
• 99th percentile is ~8 - 12ms