Awesome! Traffic to your site is really picking up and everything is lookin’ good. Well, except for that database back in the corner, but it will hold… right? No one really wants to deal with scaling the database tier, but hopefully your customers will drag you (perhaps kicking and screaming) to some sort of distributed database architecture.
This talk is all about scaling MySQL through hardware optimizations and sharding from a Site Engineering perspective. This includes real world examples of finding pain points, identifying risks, and evaluating cloud vs hardware scaling. I’ll also discus distributed database management, dealing with data purging, making consistent backups, and how to keep the site up when things go bad.
Dev Dives: Streamline document processing with UiPath Studio Web
Getting 100B Metrics to Disk
1. 194B
GETTING 100B METRICS TO DISK
Jonathan Thurman -Site Reliability Engineer
@jthurman42
http://www.flickr.com/photos/meteopassione/9157134653/
2. NEW RELIC
• Performance Monitoring
• Web Apps
• Mobile Apps
• Servers
• Databases, Caches & More…
• Software Analytics
3. O K AY, Y O U
C O L L E C T D ATA
• 194 Billion Metrics
• 100,000 req/sec
• 2 Gbps Inbound
• 216 Terabytes
• All backed my MySQL
http://www.flickr.com/photos/bobsfever/6658919861/
4. HOW WE GOT HERE
http://www.flickr.com/photos/auvet/853157494/
5. BUILDING BLOCKS
• Hosted Environment
• Xen Virtual Machines
• Data storage
• ATA over Ethernet
• SATA drives
• MySQL 5.0
• Single Ruby on Rails Application
http://www.flickr.com/photos/riekhavoc/4648423297/
6. SHARDING FROM
INCEPTION
• Account Information
• Read heavy
• Single HA Instance
• Agent Data
• Write heavy
• 8 shards based on AccountId
http://www.flickr.com/photos/erikb/48221952/
7. TA L E O F
TWO MODELS
• Ruby on Rails
• class ShardData < ActiveRecord::Base
• Look up shard for Account
• Override ConnectionHandler
http://www.flickr.com/photos/jungle_boy/140279885/
8.
9. T R I B B L E S TA B L E S
• Metric table name contains
• AccountID
• Year and Julian Day
• Resolution
• ts_72_13221_1h
• Currently ~200k tables per DB
http://www.flickr.com/photos/15942690@N00/4571141076/
10. BINGE AND PURGE
• Purging data
• DELETE FROM …
• DROP TABLE …
• innodb_file_per_table
• innodb_lazy_drop_table
(pre 5.5.30-30.2)
http://www.flickr.com/photos/exalthim/2261294871/
12. G R O W I N G PA I N S
http://www.flickr.com/photos/aigle_dore/5626285743/
13. M U LT I P L E P O I N T S
O F FA I L U R E
• Single shard slows down
• App servers wait for response
• DB connection pool becomes full
• Site goes down
http://www.flickr.com/photos/boston_public_library/8204384670/
14. SHARDGUARD
• Monitor all databases
• Identify shard status:
• Bad? Mark as “wedged”
• Good? Clear “wedged” flag
• ShardData checks status!
http://www.flickr.com/photos/mac_filko/5486980804/
15. S TA B I L I T Y A N D
PERFORMANCE
• Degraded performance
• New Accounts => Shard 9!
• Old accounts remain as-is
http://www.flickr.com/photos/ejpphoto/7823027272/
16. D ATA C O L L E C T I O N
• Rails isn’t great for data collection
• Ruby isn’t great either…
• Rewritten in Java using Jetty
http://www.flickr.com/photos/autograt/224540606/
18. INSERT INTO
(SELECT …
• Select rows and re-process
• Cache last hour in Java’s Heap
• Write a journal and post-process it
http://www.flickr.com/photos/esoteric_13/4741001804/
19. READ / WRITE
PROBLEM
• Sequential Inserts
• Batched in 5k chunks
• Optimize for Throughput
• Must complete < 1 minute
20. READ / WRITE
PROBLEM
• Scattered Reads
• Optimized for Latency
• Unique Covering Indexes
21. MOVE TO
HARDWARE
• Instant performance!
• Just add…
• Datacenter - Chicago, US
• Servers - Dell
• Storage - Direct Attached
• Time - About 6 months
http://www.flickr.com/photos/zebble/9621007/
23. T H E G R E AT
E X PA N S E
• MD1200s support 12 disks
• Add four more!
• Online RAID expansion
http://www.flickr.com/photos/aigle_dore/5853807037/
24. # FA I L
• “On-line” expansion, not so much
• Added second 4 disk RAID 5
• LVM Concatenation for space
http://www.flickr.com/photos/fireflythegreat/2845637227/
25. NEED MORE
C A PA C I T Y
• Tight on disk space
• Performance not an issue
• New Accounts => Shard 10!
• Old Accounts as-is
http://www.flickr.com/photos/seandreilinger/6289721616/
26.
27. S H A R D P I T FA L L S
http://www.flickr.com/photos/21206761@N00/469110140/
28. M I G R AT I O N
PROBLEM
• Accounts cannot move
• Not all tables have the shard key
• Rails defaults to auto-increment IDs
• Massive primary key collisions
• Punt and move the metrics
http://www.flickr.com/photos/tzafrir/125380911/
29. BREAKING UP IS
HARD TO DO
• Agent Databases
• Metadata / Notes / Errors
• Timeslice Databases
• Time-series metric data
• 1 Minute and 1 Hour resolution
http://www.flickr.com/photos/rsepulveda/4275236049/
30.
31. RESOURCE POOLS
• Distributed by Shard Key
• Distribution can CHANGE
• Lookup table, not hash
• Data can be MOVED
http://www.flickr.com/photos/dclark3996/4971906528/
32. BACKUPS
• Custom mysqldump wrapper
• Based on business need
• Backup per table
• Ignore tables to be purged
http://www.flickr.com/photos/usdagov/6896218334/
34. SSD REVOLUTION
• 600GB Intel 320 SSDs
• Dell MD1220 Direct Attached shelf
• Disks are no longer the bottle-neck
• Inserts in Read-optimized order
are “fast enough”
35. YOU CAN USE SSD
W I T H D ATA B A S E S
• 6 of 420 drives RMA’d
• March 2012 to Aug 2013
• Average 180TB lifetime writes
• 91% wear remaining
http://www.flickr.com/photos/joeshlabotnik/3584172834/
36. R E D U N D A N T A R R AY
OF EXPENSIVE DISKS
• Rebuilds under load > 4 hours
• Migrated to RAID 60
• 2 x 12 disk span
• Ditch the Hot-spares
http://www.flickr.com/photos/mbk/27640225/
38. SHARDGUARD
PA R T D E U X
• Protect all the things!
• Kill UI queries over 75 seconds
• Kill background queries over 1 hour
• Yes, all of them
• No really, kill them, now
http://www.flickr.com/photos/chiky/7194089194/
39. IF YOU DON’T
BELIEVE ME…
• Delayed Job
• Long running background query
• InnoDB History List Traversal
40. TO INFINITY AND BEYOND
http://www.flickr.com/photos/temma2/1149223191/