Designing IA for AI - Information Architecture Conference 2024
Ceph Day Chicago - Ceph at work at Bloomberg
1. CEPH AT WORK
IN BLOOMBERG
Object Store, RBD and OpenStack
August 18, 2015
By: Chris Jones
Copyright 2015 Bloomberg L.P.
2. BLOOMBERG
2
30 Years in under 30 Seconds
● Subscriber base financial information provider (Bloomberg Terminal)
● Online, TV, Print, Real-time streaming information
● Offices and customers in every major financial market and institution
worldwide
3. BLOOMBERG
3
Primary product - Information
● Bloomberg Terminal
− Approximately over 60,000 features/functions. For
example, ability to track oil tankers in real-time via
satellite feeds
● Most important internal feature to me is the “Soup
List.”
− Note: Exact numbers are not specified. Contact
media relations for specifics and other important
information.
6. CLOUD INFRASTRUCTURE GROUP
6
Primary customers
– Developers
– Product Groups
● Many different
development
groups throughout
our organization
● Many thousands of
developers
throughout our
organization
● Everyone of them
wants and needs
resources
7. CLOUD INFRASTRUCTURE GROUP
7
Resource Problems
● Developers
− Development
− Testing
− Automation (Cattle vs. Pets)
● Organizations
− POC
− Products in production
− Automation
● Security/Networking
− Compliance
8. HCP “THEOREM”
8
All distributed storage and cloud
computing systems fall under what I call
the HCP “Theorem”. Unlike the CAP
Theorem where you can have
Consistence or Availability but not both.
With HCP you are guaranteed to have two
if not all in a distributed scalable system.
The question is, how do you lessen or
remove the parts in this endless cycle? Painful
10. HOW DID WE SOLVE IT OR DID WE?
10
Hard Painful
Complex
We focused on the “sweet spot”
● Hard
− Open Source products with strong community support
− We looked for compute, networking and storage that scaled
− Engaged Security and Networking teams
● Complex
− Automation – Chef, Ansible. Everything must be able to be rebuilt from
source control (Git). No manual steps
− Engaged Security and Networking teams
● Painful
− Created converged architecture (compute/storage). In theory it looked like
it would fit in the sweet spot but in reality it created more pain
− Still working to get our developers to treat their resources as Cattle vs.
Pets – NO Pets Policy!
− Talent
− Engaged Security and Networking teams
● Sweet spot
− Ceph – Object Store and RBD Block/Volume
− OpenStack (not all projects)
12. USE IN BLOOMBERG
12
● Ceph – RGW (Object Store)
● Ceph – RBD (Block/Volumes)
● OpenStack
─ Compute, Keystone, Cinder, Glance…
─ Ephemeral storage (new)
● Object Store is
becoming one of
the most popular
items
● OpenStack
Compute with
Ceph backed block
store volumes are
very popular
● We are introducing
ephemeral
compute storage
14. INTUITIVE OR COUNTER INTUITIVE
14
Hard Painful
Complex
Completely Converged Architecture
● OpenStack and Ceph
− Reduced footprint
− Scalability
− Attempt to reduce “Hard” and “Complex” and
eliminate “Painful”
● Controller (Head) Nodes
− Ceph Mon, Ceph OSD, RGW
− Nova, Cinder, MySQL, RabbitMQ, etc
● Side Affects (Sometimes you fail)
− Had to increase pain tolerance
− Initial automation did get easier (reduced
“Hard”) but “Complex” increased along with
“Pain”
− Made it more painful to balance loads
16. OSD BANDWIDTH – ATTEMPT TO BETTER IT
16
Renice OSD daemons – Above: Higher is better
17. OSD LATENCY – ATTEMPT TO BETTER IT
17
Renice OSD daemons – Above: Lower is better
NOTE: Chart mislabeled – left should be (ms) but is seconds
18. LESSON LEARNED? - BETTER SOLUTION?
18
Hard Painful
Complex
Semi-Converged Architecture - POD
● OpenStack and Ceph
− “Complex” increases but “Hard” and “Painful”
decrease. “Painful” could be gone but we are talking
about OpenStack too
● Controller (Head) Nodes
− Nova, Cinder, MySQL, RabbitMQ, etc. split and
balanced better
− More purpose built but easily provisioned as needed
● Ceph Nodes
− Split Object Store out of OpenStack Cluster so it can
scale easier
− Dedicated Ceph Mons
− Dedicated Ceph OSDs
− Dedicated RGW – Replaced Apache with Civetweb
− Much better performance and maintenance
21. EPHEMERAL VS. CEPH BLOCK STORAGE
21
Numbers will vary in different environments. Illustrations are simplified.
Ceph Ephemeral
22. EPHEMERAL VS. CEPH BLOCK STORAGE
22
Numbers will vary in different environments. Illustrations are simplified.
Ceph – Advantages
● All data is replicated at least 3 ways across the cluster
● Ceph RBD volumes can be created, attached and detached from any hypervisor
● Very fast provisioning using COW (copy-on-write) images
● Allows easy instance re-launch in the event of hypervisor failure
● High read performance
Ephemeral – Advantages
● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency
● Can provide fairly large volumes for cheap
Ceph – Disadvantages
● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)
● Higher latency due to Ceph being network based instead of local
Ephemeral – Disadvantages
● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost
● Maybe difficult to add more capacity (depends on type of RAID)
● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph
● Less important, with RAID your drives need to be same size or you lose capacity
23. EPHEMERAL VS. CEPH BLOCK STORAGE
23
Numbers will vary in different environments. Illustrations are simplified.
EPHEMERAL CEPH
Block write bandwidth (MB/s) 1,094.02 642.15
Block read bandwidth (MB/s) 1,826.43 639.47
Character read bandwidth (MB/s) 4.93 4.31
Character write bandwidth (MB/s) 0.83 0.75
Block write latency (ms) 9.502 37.096
Block read latency (ms) 8.121 4.941
Character read latency (ms) 2.395 3.322
Character write latency (ms) 11.052 13.587
Note: Ephemeral in JBOD/LVM mode is not as fast as Ceph
24. OBJECT STORE STACK (SINGLE RACK)
24
Small single purpose (lab or whatever) cluster/rack – RedHat 7.1
● Rack = Cluster
● Smaller Cluster – Storage node number could be “short stack”
● 1 TOR and 1 Rack Mgt Node
● 3 Ceph Mon Nodes (No OSDs)
● Up to 14 Ceph OSD nodes (depends on size)
● 2x or 3x Replication depending on need (3x default)
● 1 RGW (coexist with Mon or OSD Node)
● 10g Cluster interface
● 10g Public interface
● 1g Management interface
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for Journals with 12:1 ratio
− Choose based on tolerance level and failure domain for specific use case
− ~1PB of raw space - ~330TB of usable (depends on drives)
3 Mon Nodes
TOR/IPMI
Storage Nodes
25. OBJECT STORE STACK (3 RACK CLUSTER)
25
1 Mon/RGW Node
Per rack
TOR - Leaf
Storage Nodes
Spine Spine LBLB
26. OBJECT STORE STACK (3 RACK CLUSTER)
26
Standard cluster is 3 or more racks
● Min of 3 Racks = Cluster
● 1 TOR and 1 Rack Mgt Node
● 1 Ceph Mon node per rack (No OSDs)
● Up to 15 Ceph OSD nodes (depends on
size) per rack
● 1 RGW (dedicated Node)
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition
on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals
with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for
Journals with 12:1 ratio
− Choose based on tolerance level and failure
domain for specific use case
1 Mon/RGW Node
TOR/IPMI
Storage Nodes
27. OBJECT STORE STACK
27
Standard configuration
● Min of 3 Racks = Cluster
● Cluster Network: Bonded 10g or higher depending on size of cluster
● Public Network: Bonded 10g for RGW interfaces
● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep
odd number of Mons so some racks may not have Mons. We try to keep
larger cluster racks & Mons in different power zones
● We have developed a healthy “Pain” tolerance. We can survive an entire
rack going down but we mainly see drive failures and more node failures.
● Min 1 RGW (dedicated Node) per rack (may want more)
● Hardware load balancers to RGWs with redundancy
● OSD Nodes (lower density nodes) – we have both. Actually looking at new
hardware and drive options
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
28. AUTOMATION
28
All of what we do only happens because of automation
● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for
orchestration and maintenance
● Bloomberg Github: https://github.com/bloomberg/chef-bcpc
● Ceph specific options
− Ceph Deploy: https://github.com/ceph/ceph-deploy
− Ceph Ansible: https://github.com/ceph/ceph-ansible
− Ceph Chef: https://github.com/ceph/ceph-cookbook
● Our bootstrap server is our Chef server per cluster
29. TESTING
29
Testing is critical. We use different strategies for the different parts of
OpenStack and Ceph we test
● OpenStack
− Tempest – We currently only use this for patches we make. We plan to use this more in our
DevOps pipeline
− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself
● Ceph
− RADOS Bench
− COS Bench – Going to try this with CBT
− CBT – Ceph Benchmark Testing
− CeTune
− Bonnie++
− FIO
● Ceph – RGW
− Jmeter – Need to test load at scale. Takes a cloud to test a cloud
● A lot of the times you find it’s your network, load balancers etc
31. CEPH USE CASE DEMAND – GROWING!
31
Ceph
*Real-time
Object
ImmutableOpenStack
Big Data*?
*Possible use cases if performance is enhanced
32. WHAT’S NEXT?
32
Continue to evolve our POD architecture
● OpenStack
− Work on performance improvements and track stats on usage for departments
− Better monitoring
● Containers and PaaS
− We’re currently evaluating PaaS software and container strategies now
● Better DevOps Pipelining
− GO CD and/or Jenkins improved strategies
− Continue to enhance automation and re-provisioning
− Add testing to automation
● Ceph
− Erasure coding
− Performance improvements – Ceph Hackathon showed very promising improvements
− RGW Multi-Master (multi-sync) between datacenters
− Enhanced security – encryption at rest (can already do) but with better key management
− Purpose built pools for specific use cases (i.e., lower density but blazingly fast hot swappable NVMe SSDs)
− Possible RGW Caching. External pulls come only from CDN
34. ADDITIONAL RESOURCES
34
● Chris Jones: cjones303@bloomberg.net
● Twitter: @hanschrisjones, @iqstack, @cloudm2
● BCPC: https://github.com/bloomberg/chef-bcpc
− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster
● Ceph Hackathon: http://pad.ceph.com/p/hackathon_2015-08
● *Soon – A pure Ceph Object Store (COS) repo will be in the Bloomberg
Github repo
− This will have no OpenStack and only be Object Store (RGW – Rados
Gateway), no block devices (RBD)
● Other repos (automation, new projects, etc.):
− IQStack: https://github.com/iqstack - managed by me (disclosure)
− Personal: https://github.com/cloudm2 - me
− Ansible: https://github.com/ceph/ceph-ansible
− Chef: https://github.com/ceph/ceph-cookbook - this one is going through a
major overhaul and also managed by me for Ceph