Axa Assurance Maroc - Insurer Innovation Award 2024
Â
2016-JAN-28 -- High Performance Production Databases on Ceph
1. Medallia Š Copyright 2015. Confidential. 1
High-Performance Production
Databases on Ceph
2. Medallia Š Copyright 2016.. 2
At Medallia, we collect, analyze, and display terabytes of structured &
unstructured feedback for our multibillion dollar clients in real time.
And whatâs more: we have a lot of fun doing it.
Iâve been at Medallia since 2010, growing from 70 to 700 employees.
Who are you?
Hi, Iâm Thorvald, Architect @ Medallia
3. Medallia Š Copyright 2015. Confidential. 3
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
4. Medallia Š Copyright 2015. Confidential. 4
Tech Industry Speak for âLast Yearâ
⢠New version of our analytics engine
â Dream: Horizontally super-scalable! 1000s of servers!
⢠Reality: Peeking at Production
â 100s of servers
â .. that had individual names
â .. almost, but not quite, entirely unlike each other
â .. manual service placement
.. and server placement
â âDonât touch itâ
A long, long time ago...
5. Medallia Š Copyright 2015. Confidential. 5
⢠Skip 2-3 generations and go direct to ânext genâ
â MicroServices, Containers, <insert buzz-word here>
⢠Proof-of-Concept using 40GbE, Ceph, Docker
â Resilient enough that itâs a problem to test resiliency
â Performant enough to replace dedicated servers
⢠Can we run everything on this new infrastructure?
Rapid Evolution Time!
Jump into the future
6. Medallia Š Copyright 2016.. 6
Design Goals
Keep it SIMPLE
⢠Commodity Components &
Supported Open Standards
⢠Fully automated provisioning
and reinstall
⢠Cheap & Scalable
⢠Immutable Servers
â No service that is tied to
specific hardware
â Every component must be
able to run anywhere
â Redundancy at App Layer
â Self-Healing
No Special MachinesCommodity Products
7. Medallia Š Copyright 2016.. 7
Linux (Ubuntu)
2xIntel E5-v3
256GB Memory
40GbE Network
100GB SSD
Standard Rack
22 x Compute Node 3 x Networking
Linux (Cumulus)
1xIntel Atom 64-bit
8 GB Memory
32x40 GbE Network
8 x Storage Node
Linux (Ubuntu)
1xIntel E5-v3
64 GB Memory
40GbE Network
8x800GB SSD
PCIe NVRAM
Unified and Scalable
8. Medallia Š Copyright 2015. Confidential. 8
Where do you draw the line?
⢠Application in relocatable Container?
⢠Load-balancer in relocatable Container?
⢠DNS server in relocatable Container?
⢠Database in relocatable Container?
Challenge
Everything as containers
Everything in Containers
9. Medallia Š Copyright 2015. Confidential. 9
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
12. Medallia Š Copyright 2016.. 12
Route Propagation
⢠Open Shortest Path First
â Propagated Link State Database
â Supported by every vendor
⢠Computes network paths with Dijkstra algorithm
⢠Moving 30 000 routes: ~ 1 second
⢠BGP works just as well, OSPF auto-configures easier
OSPF
Fully relocated IP address
14. Medallia Š Copyright 2016.. 14
⢠Docker images are ephemeral
⢠Persistent volumes to the rescue!
â Which work on your local machine only
⢠Proprietary Solutions needed for HA
â iSCSI (Large Storage Vendor)
â NFS (Large Storage Vendor, and.. performance?)
â pNFS (... right)
⢠Scale up, but not out
⢠SLA? 4 hours hardware support not good enough!
Storage Mobility
Where did the filesystem go?
15. Medallia Š Copyright 2016.. 15
⢠No need to communicate with metadata servers in hot path
⢠Clean design; we understand enough to go fix problems ourselves
⢠Need more capacity?
â Add servers!
⢠Need more aggregate performance?
â Add servers!
⢠Need more single-node performance?
â Get creative!
Ceph
Short Version
17. Medallia Š Copyright 2016.. 17
What happens when the server for your monitor dies?
⢠Itâs âinterestingâ to switch Ceph monitor IPs. So donât.
â The monitors are services; each gets a unique IP.
⢠If machine hosting monitor dies, start same monitor somewhere else
with same IP.
â Itâll clone data from the other monitors
⢠Not automated (somewhat high fubar potential)
Relocatable Infrastructure
Relocatable Monitors
18. Medallia Š Copyright 2015. Confidential. 18
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
19. Medallia Š Copyright 2016.. 19
⢠Pre-OS linux + initramfs from PXE+HTTP
⢠Unlocks self-encrypting drives (Data-at-Rest encryption)
â Key never known by runtime OS
⢠Check state:
â Update Firmware? Unify BIOS version and config?
â Install OS?
â Boot OS?
⢠Completely uniform machines -- no half-installed, half-forgotten
state.
Remote Boot
Always boot from PXE
20. Medallia Š Copyright 2016.. 20
Apache Aurora/Mesos
Mesos
Master
Mesos
Master
Aurora
Scheduler
Aurora
Scheduler
Aurora
Scheduler Mesos
Master
NODE-1
32 CPU
256 GB
NODE-2
12 CPU
128 GB
NODE-3
32 CPU
256 GB
NODE-4
32 CPU
256 GB
NODE-5
12 CPU
128 GB
NODE-6
32 CPU
256 GB
NODE-7
12 CPU
128 GB
NODE-8
32 CPU
256 GB
Mesos
Slaves
Create New Job!
docker-image
medallia/service1
resources
2*CPU
1*GB
instances
3
Zookeeper
Aurora
Scheduler
Aurora
Scheduler
Hadoop
Scheduler
Aurora
Scheduler
Aurora
Scheduler
Storm
Scheduler
âProgram against your datacenter like itâs a single pool of resourcesâ
26. Medallia Š Copyright 2015. Confidential. 26
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
27. Medallia Š Copyright 2015. Confidential. 27
⢠SSDs to 100k 4k random write IOPS!
â If you have a âIO pipelineâ
⢠Real-world:
â Read: Databases donât have IO depth of 64. Itâs 1.
â Read index, process, seek to correct index, read, process..
â Write: Databases want each and every transaction to be
acknowledged by the storage layer
â Full round-trip down to the storage layer
⢠Dedicated DB servers have a LOT of buffer cache
â 24x800GB SSD = $15k. 512 GB RAM = $4k.
Real-World vs Synthetic IO
Latency, not IOPS or bandwidth!
28. Medallia Š Copyright 2015. Confidential. 28
⢠We have two types of tables
â âA few GBâ
â âA few TBâ
⢠Application does heavy caching; few read requests
⢠DB Containers have plenty memory; most tables sit in buffer cache
⢠If a user actually modifies something, thereâs a transaction...
What performance matters for DB?
fdatasync() is bottleneck
29. Medallia Š Copyright 2015. Confidential. 29
Easy!
Slow!
Mixed read-write:
~640 iops
3 Ways to Mount
FUSE KRBD
Easy!
Fast...er
Mixed read-write:
~1550 iops
No fancy image
features
iSCSI tgt rbd
Hard!
Slow!
Mixed read-write:
~600 iops
30. Medallia Š Copyright 2015. Confidential. 30
Something that resembles PG
⢠Can (and do) use PGbench, but pgbench workload and our real
workload differ.
⢠Observe production IO pattern, replicate with fio
â Once something provides good results on fio, apply to real DB
⢠Allow buffer cache
â Yes, you have it on in production
⢠IOdepth=1, 8 jobs, 8kb blocks
⢠fdatasync() every 100th block
⢠Very Large Files, semi-random access
⢠PG doesnât use fancy IO, so neither does our benchmark
âRealisticâ testing with FIO
32. Medallia Š Copyright 2015. Confidential. 32
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
33. Medallia Š Copyright 2016.. 33
Fun With Locking
⢠Switch is rebooted
⢠Aurora detects compute node dead
â Restarts job somewhere else
⢠New location mounts Ext4 filesystem
⢠Switch finishes rebooting
⢠Old job, still running, now writes to the mounted filesystem
⢠âHow to repair a broken ext4 filesystem with a critical databaseâ
Ext4 on RBD
Test all failure scenarios
34. Medallia Š Copyright 2016.. 34
⢠On Map; ârbd lock add <image> â
â If no success; then
â ârbd status <imageâ: Check for watcher, 3 times, 15s
â If found, ABORT, ABORT!
â âceph osd blacklist add <previous lock holder>â
â Steal lock
⢠On unmap; rbd lock remove
⢠On reboot; âceph osd blacklist rm <self>â
Workaround
Modified RBD wrapper; /bin/sh to the rescue!
35. Medallia Š Copyright 2015. Confidential. 35
Great, we beat legacy hardware⌠Or did we?
⢠Legacy hardware better write latency for <90% latency mark, worse
for >90%. Higher average write IOPS.
⢠We want no compromise on performance
⢠Currently rolling out PMC NV1616 NVRAM for Ceph write journal
â Single storage-server test very promising.
â Large-scale test ready in 2 weeks
⢠Experimenting with RoCE v2; RDMA over UDP
⢠Will post results to Ceph mailing list
Make it faster!
36. Medallia Š Copyright 2016.. 36
Try this out!
Available now:
⢠Docker w/ Storage and Networking
⢠Aurora
Coming soon:
⢠DCIB
github.com/medallia