14. THE NUMBERS
• Machines
• HBase
• 60 Machines as RegionServers
•1 HMaster
•3 Zookeeper nodes
15. THE NUMBERS
• Machines
• Hadoop
• 135 Machines divided into 2 clusters
• Datanodes/Tasktrakers
• Namenodes with High-Availability Failover
•1 Jobtracker each
16. THE NUMBERS
• Machines
• DL380 Gen8
• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)
• 48GB RAM
• 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is
storage)
• 1 Gigabit network links
17. THE NUMBERS
• Data
• Average load of 7500 interactions per second
• Peak loads of 15000 interactions per second sustained over a min
• Peak of 21000 interactions per second during superbowl
• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB
• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with
replication (RF = 3)
• And that’s not it!
18. THE USE CASES
• HBase
• Recordings
• Archive
• Map/Reduce
• Exports
• Historics
• Migration
19. THE USE CASES
• Recordings
• User defined streams
• Stored in HBase for later retrieval
• Export to multiple output formats and stores
• <recording-id><interaction-uuid>
• Recording-id is a SHA-1 hash
• Allows recordings to be distributed by their key without generating hot-
spots.
21. THE USE CASES
• Exporter
• Export data from HBase for customer
• Export files ~ 5 – 10 GB or ~ 3-6 million records
• MR over HBase using TableInputFormat
• But the data needs to be sorted
• TotalOrderPartioner
24. THE USE CASES
• Twitter Import
•2 years of Tweets
• About 95,000,000,000 tweets
• Over 300 TB with added augmentation
• Import was not as simple as you would imagine
25. THE USE CASES
• Archive
• Not just the Firehose but the Ultrahose
• Stored in HBase as well
• HBase architecture (BigTable) creates Hotspots with Time Series data
• Leading randomizing bit (see HBaseWD)
• Pre-split regions
• Concurrent writes
26. THE USE CASES
• Historics
• Export archive data
• Slightly different from Exporter
• Much larger time lines (1 – 3 months)
• Controlled access to Hadoop cluster with efficient job scheduling
• Unfiltered Input Data
• Therefore longer processing time
• Hence more optimizations required
28. THE LESSONS
• Tune Tune Tune (Default == BAD)
• Based on use case tune -
• Heap
• Block Size
• Memstore size
• Keep number of column families low
• Be aware of hot-spotting issue when writing time-series data
29. THE LESSONS
• Use compression (eg. Snappy)
• Ops need intimate understanding of system
• Monitor system metrics (GC, CPU, Compaction, I/O) and
application metrics (writes/sec etc)
• Don't be afraid to fiddle with HBase code
• Using a distribution is advisable
30. QUESTIONS?
We are hiring
http://datasift.com/about-us/careers