Bloomberg's Chris Jones and Chris Morgan joined Red Hat Storage Day New York on 1/19/16 to explain how Red Hat Ceph Storage helps the financial giant tackle its data storage challenges.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
1. CEPH AT WORK
IN BLOOMBERG
Object Store, RBD and OpenStack
January 19, 2016
By: Chris Jones & Chris Morgan
2. BLOOMBERG
2
30 Years in under 30 Seconds
● Subscriber based financial provider (Bloomberg Terminal)
● Online, TV, print, real-time streaming information
● Offices and customers in every major financial market and institution
worldwide
3. BLOOMBERG
3
Primary product - Information
● Bloomberg Terminal
− Approximately 60,000 features/functions. For
example, ability to track oil tankers in real-time via
satellite feeds
− Note: Exact numbers are not specified. Contact
media relations for specifics and other important
information.
5. CLOUD INFRASTRUCTURE GROUP
5
Primary customers
– Developers
– Product Groups
● Many different
development
groups throughout
our organization
● Currently about
3,000 R&D
developers
● Everyone of them
wants and needs
resources
6. CLOUD INFRASTRUCTURE GROUP
6
Resource Challenges
● Developers
− Development
− Testing
− Automation (Cattle vs. Pets)
● Organizations
− POC
− Products in production
− Automation
● Security/Networking
− Compliance
7. USER BASE (EXAMPLES)
7
Resources and Use cases
● Multiple Data Centers
− Each DC contains *many* Network Tiers which includes a DMZ for Public-
facing Bloomberg assets
− There is at least one Ceph/OpenStack Cluster per Network Tier
● Developer Community Supported
− Public facing Bloomberg products
− Machine learning backend for smart apps
− Compliance-based resources
− Use cases continue to climb as Devs need more storage and compute
capacity
9. USED IN BLOOMBERG
9
● Ceph – RGW (Object Store)
● Ceph – Block/Volume
● OpenStack
─ Different flavors of compute
─ Ephemeral storage
● Object Store is
becoming one of
the most popular
items
● OpenStack
compute with Ceph
backed block store
volumes are very
popular
● We introduced
ephemeral
compute storage
11. SUPER HYPER-CONVERGED STACK
11
(Original) Converged Architecture Rack Layout
● 3 Head Nodes (Controller Nodes)
− Ceph Monitor
− Ceph OSD
− OpenStack Controllers (All of them!)
− HAProxy
● 1 Bootstrap Node
− Cobbler (PXE Boot)
− Repos
− Chef Server
− Rally/Tempest
● Remaining Nodes
− Nova Compute
− Ceph OSDs
− RGW – Apache
● Ubuntu
● Shared spine with Hadoop resources
Bootstrap Node
Compute/Ceph OSDs/RGW/Apache
Remaining Stack
Sliced View of Stack
12. NEW POD ARCHITECTURE
12
POD
(TOR)
HAProxy
OS-Nova
OS-NovaOS-Rabbit
OS-DB
Number of large providers have taken similar approaches
Note: Illustrative only – Not Representative
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
RBD Only
Bootstrap
Monitoring
Ephemeral
Ephemeral – Fast/Dangerous
Host aggregates & flavors
Not Ceph backed
14. EPHEMERAL VS. CEPH BLOCK STORAGE
14
Numbers will vary in different environments. Illustrations are simplified.
Ceph Ephemeral
New feature option added to address high IOP applications
15. EPHEMERAL VS. CEPH BLOCK STORAGE
15
Numbers will vary in different environments. Illustrations are simplified.
Ceph – Advantages
● All data is replicated at least 3 ways across the cluster
● Ceph RBD volumes can be created, attached and detached from any hypervisor
● Very fast provisioning using COW (copy-on-write) images
● Allows easy instance re-launch in the event of hypervisor failure
● High read performance
Ephemeral – Advantages
● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency
● Can provide fairly large volumes for cheap
Ceph – Disadvantages
● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)
● Higher latency due to Ceph being network based instead of local
Ephemeral – Disadvantages
● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost
● May be difficult to add more capacity (depends on type of RAID)
● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph
● Less important, with RAID your drives need to be same size or you lose capacity
16. EPHEMERAL VS. CEPH BLOCK STORAGE
16
Numbers will vary in different environments. Illustrations are simplified.
EPHEMERAL CEPH
Block write bandwidth (MB/s) 1,094.02 642.15
Block read bandwidth (MB/s) 1,826.43 639.47
Character read bandwidth (MB/s) 4.93 4.31
Character write bandwidth (MB/s) 0.83 0.75
Block write latency (ms) 9.502 37.096
Block read latency (ms) 8.121 4.941
Character read latency (ms) 2.395 3.322
Character write latency (ms) 11.052 13.587
Note: Ephemeral in JBOD/LVM mode is not as fast as Ceph
Numbers can also increase with additional tuning and different devices
17. CHALLENGES – LESSONS LEARNED
17
Network
● It’s all about the network.
− Changed MTU from 1500 to 9000 on certain interfaces (Float interface – Storage interface)
− Hardware Load Balancers – keep an eye on performance
● Hardware
− Moving to a more commodity driven hardware
− All flash storage in compute cluster (high cost, good for block and ephemeral)
Costs
● Storage costs are very high in a converged compute cluster for Object Store
Analytics
● Need to know how the cluster is being used
● Need to know if the tps meets the SLA
● Test going directly against nodes and then layer in network components until you can
verify all choke points in the data flow path
● Monitor and test always
19. OBJECT STORE STACK (RACK CONFIG)
19
RedHat 7.1
● 1 TOR and 1 Rack Mgt Node
● 3 1U nodes (Mon, RGW, Util)
● 17 2U Ceph OSD nodes
● 2x or 3x Replication depending on need (3x default)
● Secondary RGW (may coexist with OSD Node)
● 10g Cluster interface
● 10g Public interface
● 1 IPMI interface
● OSD Nodes (high density server nodes)
− 6TB HDD x 12 – Journal partitions on SSD
− No RAID1 OS drives – instead we partitioned off a
small amount of SSD1 for OS and swap with remainder
of SSD1 used for some journals and SSD2 used for
remaining journals
− Failure domain is a node
3 1U Nodes
TOR/IPMI
Converged
Storage Nodes
2U each
20. OBJECT STORE STACK (ARCHITECTURE)
20
1 Mon/RGW Node
Per rack
TOR - Leaf
Storage Nodes
Spine Spine LBLB
21. OBJECT STORE STACK
21
Standard configuration (Archive Cluster)
● Min of 3 Racks = Cluster
● OS – Redhat 7.1
● Cluster Network: Bonded 10g or higher depending on size of cluster
● Public Network: Bonded 10g for RGW interfaces
● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep odd number of Mons so some racks may not
have Mons. We try to keep larger cluster racks & Mons in different power zones
● We have developed a healthy “Pain” tolerance. We mainly see drive failures and some node failures.
● Min 1 RGW (dedicated Node) per rack (may want more)
● Hardware load balancers to RGWs with redundancy
● Erasure coded pools (no cache tiers at present – testing). We also use a host profile with 8/3 (k/m)
● Near full and full ratios are .75/.85 respectfully
● Index sharding
● Federated (regions/zones)
● All server nodes, no JBOD expansions
● S3 only at present but we do have a few requests for Swift
● Fully AUTOMATED – Chef cookbooks to configure and manage cluster (some Ansible)
22. AUTOMATION
22
All of what we do only happens because of automation
● Company policy – Chef
● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for
orchestration and maintenance
● Bloomberg Github: https://github.com/bloomberg/bcpc
● Ceph specific options
− Ceph Chef: https://github.com/ceph/ceph-chef
− Bloomberg Object Store: https://github.com/bloomberg/chef-bcs
− Ceph Deploy: https://github.com/ceph/ceph-deploy
− Ceph Ansible: https://github.com/ceph/ceph-ansible
● Our bootstrap server is our Chef server per cluster
23. TESTING
23
Testing is critical. We use different strategies for the different parts of
OpenStack and Ceph we test
● OpenStack
− Tempest – We currently only use this for patches we make. We plan to use this more in our
DevOps pipeline
− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself
● Ceph
− RADOS Bench
− COS Bench – Going to try this with CBT
− CBT – Ceph Benchmark Testing
− Bonnie++
− FIO
● Ceph – RGW
− Jmeter – Need to test load at scale. Takes a cloud to test a cloud
● A lot of the times you find it’s your network, load balancers etc
24. CEPH USE CASE DEMAND – GROWING!
24
Ceph
*Real-time
Object
ImmutableOpenStack
Big Data*?
*Possible use cases if performance is enhanced
25. WHAT’S NEXT?
25
Continue to evolve our POD architecture
● OpenStack
− Work on performance improvements and track stats on usage for departments
− Better monitoring
− LBaaS, Neutron
● Containers and PaaS
− We’re currently evaluating PaaS software and container strategies now
● Better DevOps Pipelining
− GO CD and/or Jenkins improved strategies
− Continue to enhance automation and re-provisioning
− Add testing to automation
● Ceph
− New Block Storage Cluster
− Super Cluster design
− Performance improvements – testing Jewel
− RGW Multi-Master (multi-sync) between datacenters
− Enhanced security – encryption at rest (can already do) but with better key management
− NVMe for Journals and maybe for high IOP block devices
− Cache Tier (need validation tests)
27. ADDITIONAL RESOURCES
27
● Chris Jones: cjones303@bloomberg.net
− Github: cloudm2
● Chris Morgan: cmorgan2@bloomberg.net
− Github: mihalis68
Cookbooks:
● BCC: https://github.com/bloomberg/bcpc
− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster
● BCS: https://github.com/bloomberg/chef-bcs
● Ceph-Chef: https://github.com/ceph/ceph-chef
The last two repos make up the Ceph Object Store and full Ceph Chef
Cookbooks.