Ceph Day Chicago - Ceph at work at Bloomberg

CEPH AT WORK
IN BLOOMBERG
Object Store, RBD and OpenStack
August 18, 2015
By: Chris Jones
Copyright 2015 Bloomberg L.P.

BLOOMBERG
2
30 Years in under 30 Seconds
● Subscriber base financial information provider (Bloomberg Terminal)
● Online, TV, Print, Real-time streaming information
● Offices and customers in every major financial market and institution
worldwide

BLOOMBERG
3
Primary product - Information
● Bloomberg Terminal
− Approximately over 60,000 features/functions. For
example, ability to track oil tankers in real-time via
satellite feeds
● Most important internal feature to me is the “Soup
List.” 
− Note: Exact numbers are not specified. Contact
media relations for specifics and other important
information.

BLOOMBERG TERMINAL
4
Terminal example.
- You can search for
anything. Even bios
on the wealthiest
people in the world
(Billionaire’s list).

CLOUD INFRASTRUCTURE
*PROBLEMS TO SOLVE*
5

CLOUD INFRASTRUCTURE GROUP
6
Primary customers
– Developers
– Product Groups
● Many different
development
groups throughout
our organization
● Many thousands of
developers
throughout our
organization
● Everyone of them
wants and needs
resources

CLOUD INFRASTRUCTURE GROUP
7
Resource Problems
● Developers
− Development
− Testing
− Automation (Cattle vs. Pets)
● Organizations
− POC
− Products in production
− Automation
● Security/Networking
− Compliance

HCP “THEOREM”
8
All distributed storage and cloud
computing systems fall under what I call
the HCP “Theorem”. Unlike the CAP
Theorem where you can have
Consistence or Availability but not both.
With HCP you are guaranteed to have two
if not all in a distributed scalable system.
The question is, how do you lessen or
remove the parts in this endless cycle? Painful

ISSUES
Complex
● Security/Compliance
● Automation
● Logging, Monitoring, Auditability
● Alter thinking – educate
(painfully slow)
● Failure (hardware and
ideas)
● Increase Tolerance
● Scaling
● Compute
● Distributed Storage

HOW DID WE SOLVE IT OR DID WE?
10
Hard Painful
Complex
We focused on the “sweet spot”
● Hard
− Open Source products with strong community support
− We looked for compute, networking and storage that scaled
− Engaged Security and Networking teams
● Complex
− Automation – Chef, Ansible. Everything must be able to be rebuilt from
source control (Git). No manual steps
● Painful
− Created converged architecture (compute/storage). In theory it looked like
it would fit in the sweet spot but in reality it created more pain
− Still working to get our developers to treat their resources as Cattle vs.
Pets – NO Pets Policy!
− Talent
● Sweet spot
− Ceph – Object Store and RBD Block/Volume
− OpenStack (not all projects)

USE IN BLOOMBERG
12
● Ceph – RGW (Object Store)
● Ceph – RBD (Block/Volumes)
● OpenStack
─ Compute, Keystone, Cinder, Glance…
─ Ephemeral storage (new)
● Object Store is
becoming one of
the most popular
items
● OpenStack
Compute with
Ceph backed block
store volumes are
very popular
● We are introducing
ephemeral
compute storage

STANDARD STACK
13
OpenStack Converged Cluster

INTUITIVE OR COUNTER INTUITIVE
14
Hard Painful
Complex
Completely Converged Architecture
● OpenStack and Ceph
− Reduced footprint
− Scalability
− Attempt to reduce “Hard” and “Complex” and
eliminate “Painful”
● Controller (Head) Nodes
− Ceph Mon, Ceph OSD, RGW
− Nova, Cinder, MySQL, RabbitMQ, etc
● Side Affects (Sometimes you fail)
− Had to increase pain tolerance
− Initial automation did get easier (reduced
“Hard”) but “Complex” increased along with
“Pain”
− Made it more painful to balance loads

CONVERGED STACK
15
Converged Architecture Rack Layout
● 3 Head Nodes (Controller Nodes)
− Ceph Monitor
− Ceph OSD
− OpenStack Controllers (All of them!)
− HAProxy
● 1 Bootstrap Node
− Cobbler (PXE Boot)
− Repos
− Chef
− Rally/Tempest
● Remaining Nodes
− Nova Compute
− Ceph OSDs
− RGW – Apache
● Ubuntu
● Shared spine with Hadoop resources
Bootstrap Node
Compute/Ceph OSDs/RGW/Apache
Remaining Stack

OSD BANDWIDTH – ATTEMPT TO BETTER IT
16
Renice OSD daemons – Above: Higher is better

OSD LATENCY – ATTEMPT TO BETTER IT
17
Renice OSD daemons – Above: Lower is better
NOTE: Chart mislabeled – left should be (ms) but is seconds

LESSON LEARNED? - BETTER SOLUTION?
18
Hard Painful
Complex
Semi-Converged Architecture - POD
● OpenStack and Ceph
− “Complex” increases but “Hard” and “Painful”
decrease. “Painful” could be gone but we are talking
about OpenStack too 
● Controller (Head) Nodes
− Nova, Cinder, MySQL, RabbitMQ, etc. split and
balanced better
− More purpose built but easily provisioned as needed
● Ceph Nodes
− Split Object Store out of OpenStack Cluster so it can
scale easier
− Dedicated Ceph Mons
− Dedicated Ceph OSDs
− Dedicated RGW – Replaced Apache with Civetweb
− Much better performance and maintenance

POD ARCHITECTURE (OPENSTACK/CEPH)
19
POD
(TOR)
HAProxy
OS-Nova
OS-NovaOS-Rabbit
OS-DB
Number of large providers have taken similar approaches
Note: Illustrative only – Not Representative
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
RBD Only
Bootstrap
Monitoring
Ephemeral
Ephemeral – Fast/Dangerous
Host aggregates & flavors
Not Ceph backed

POD ARCHITECTURE (OPENSTACK/CEPH)
20
POD
(TOR)
Ceph
Block
OS-Nova
OS-NovaOS-Rabbit
OS-NovaOS-DB
Number of large providers have taken similar approaches
Note: Illustrative only – Not Representative
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
POD
(TOR)
Ceph
OSD
Ceph
Mon
Ceph
Mon
Ceph
Mon
Ceph
OSD
Ceph
OSD
• Scale and re-
provision as needed
• 3 PODs per rack

EPHEMERAL VS. CEPH BLOCK STORAGE
21
Numbers will vary in different environments. Illustrations are simplified.
Ceph Ephemeral

22
Ceph – Advantages
● All data is replicated at least 3 ways across the cluster
● Ceph RBD volumes can be created, attached and detached from any hypervisor
● Very fast provisioning using COW (copy-on-write) images
● Allows easy instance re-launch in the event of hypervisor failure
● High read performance
Ephemeral – Advantages
● Offers read/write speeds that can be 3-4 times faster than Ceph with lower latency
● Can provide fairly large volumes for cheap
Ceph – Disadvantages
● All writes must be acknowledged by multiple nodes before being considered as committed (tradeoff for reliability)
● Higher latency due to Ceph being network based instead of local
Ephemeral – Disadvantages
● Trades data integrity for speed: if one drive fails at RAID 0 then all data on that node is lost
● Maybe difficult to add more capacity (depends on type of RAID)
● Running in JBOD LVM mode w/o RAID performance was not as good as Ceph
● Less important, with RAID your drives need to be same size or you lose capacity

23
EPHEMERAL CEPH
Block write bandwidth (MB/s) 1,094.02 642.15
Block read bandwidth (MB/s) 1,826.43 639.47
Character read bandwidth (MB/s) 4.93 4.31
Character write bandwidth (MB/s) 0.83 0.75
Block write latency (ms) 9.502 37.096
Block read latency (ms) 8.121 4.941
Character read latency (ms) 2.395 3.322
Character write latency (ms) 11.052 13.587
Note: Ephemeral in JBOD/LVM mode is not as fast as Ceph

OBJECT STORE STACK (SINGLE RACK)
24
Small single purpose (lab or whatever) cluster/rack – RedHat 7.1
● Rack = Cluster
● Smaller Cluster – Storage node number could be “short stack”
● 1 TOR and 1 Rack Mgt Node
● 3 Ceph Mon Nodes (No OSDs)
● Up to 14 Ceph OSD nodes (depends on size)
● 2x or 3x Replication depending on need (3x default)
● 1 RGW (coexist with Mon or OSD Node)
● 10g Cluster interface
● 10g Public interface
● 1g Management interface
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for Journals with 12:1 ratio
− Choose based on tolerance level and failure domain for specific use case
− ~1PB of raw space - ~330TB of usable (depends on drives)
3 Mon Nodes
TOR/IPMI
Storage Nodes

OBJECT STORE STACK (3 RACK CLUSTER)
25
1 Mon/RGW Node
Per rack
TOR - Leaf
Storage Nodes
Spine Spine LBLB

OBJECT STORE STACK (3 RACK CLUSTER)
26
Standard cluster is 3 or more racks
● Min of 3 Racks = Cluster
● 1 TOR and 1 Rack Mgt Node
● 1 Ceph Mon node per rack (No OSDs)
● Up to 15 Ceph OSD nodes (depends on
size) per rack
● 1 RGW (dedicated Node)
● OSD Nodes (lower density nodes)
− Option 1: 6TB HDD x 12 – Journal partition
on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals
with 5:1 ratio
− Option 3: 6TB HDD x 12 – 1 NVMe SSD for
Journals with 12:1 ratio
− Choose based on tolerance level and failure
domain for specific use case
1 Mon/RGW Node
TOR/IPMI
Storage Nodes

OBJECT STORE STACK
27
Standard configuration
● Min of 3 Racks = Cluster
● Cluster Network: Bonded 10g or higher depending on size of cluster
● Public Network: Bonded 10g for RGW interfaces
● 1 Ceph Mon node per rack except on more than 3 racks. Need to keep
odd number of Mons so some racks may not have Mons. We try to keep
larger cluster racks & Mons in different power zones
● We have developed a healthy “Pain” tolerance. We can survive an entire
rack going down but we mainly see drive failures and more node failures.
● Min 1 RGW (dedicated Node) per rack (may want more)
● Hardware load balancers to RGWs with redundancy
● OSD Nodes (lower density nodes) – we have both. Actually looking at new
hardware and drive options
− Option 1: 6TB HDD x 12 – Journal partition on HDD
− Option 2: 6TB HDD x 10 – 2 SSD Journals with 5:1 ratio

AUTOMATION
28
All of what we do only happens because of automation
● Cloud Infrastructure Group uses Chef and Ansible. We use Ansible for
orchestration and maintenance
● Bloomberg Github: https://github.com/bloomberg/chef-bcpc
● Ceph specific options
− Ceph Deploy: https://github.com/ceph/ceph-deploy
− Ceph Ansible: https://github.com/ceph/ceph-ansible
− Ceph Chef: https://github.com/ceph/ceph-cookbook
● Our bootstrap server is our Chef server per cluster

TESTING
29
Testing is critical. We use different strategies for the different parts of
OpenStack and Ceph we test
● OpenStack
− Tempest – We currently only use this for patches we make. We plan to use this more in our
DevOps pipeline
− Rally – Can’t do distributed testing but we use it to test bottlenecks in OpenStack itself
● Ceph
− RADOS Bench
− COS Bench – Going to try this with CBT
− CBT – Ceph Benchmark Testing
− CeTune
− Bonnie++
− FIO
● Ceph – RGW
− Jmeter – Need to test load at scale. Takes a cloud to test a cloud 
● A lot of the times you find it’s your network, load balancers etc

OPENSOURCE STACK
30
https://github.com/bloomberg/chef-bcpc
Contribute or keep track to see how we’re changing things
We develop on laptops using VirtualBox before testing on real hardware

CEPH USE CASE DEMAND – GROWING!
31
Ceph
*Real-time
Object
ImmutableOpenStack
Big Data*?
*Possible use cases if performance is enhanced

WHAT’S NEXT?
32
Continue to evolve our POD architecture
● OpenStack
− Work on performance improvements and track stats on usage for departments
− Better monitoring
● Containers and PaaS
− We’re currently evaluating PaaS software and container strategies now
● Better DevOps Pipelining
− GO CD and/or Jenkins improved strategies
− Continue to enhance automation and re-provisioning
− Add testing to automation
● Ceph
− Erasure coding
− Performance improvements – Ceph Hackathon showed very promising improvements
− RGW Multi-Master (multi-sync) between datacenters
− Enhanced security – encryption at rest (can already do) but with better key management
− Purpose built pools for specific use cases (i.e., lower density but blazingly fast hot swappable NVMe SSDs)
− Possible RGW Caching. External pulls come only from CDN

ADDITIONAL RESOURCES
34
● Chris Jones: cjones303@bloomberg.net
● Twitter: @hanschrisjones, @iqstack, @cloudm2
● BCPC: https://github.com/bloomberg/chef-bcpc
− Current repo for Bloomberg’s Converged OpenStack and Ceph cluster
● Ceph Hackathon: http://pad.ceph.com/p/hackathon_2015-08
● *Soon – A pure Ceph Object Store (COS) repo will be in the Bloomberg
Github repo
− This will have no OpenStack and only be Object Store (RGW – Rados
Gateway), no block devices (RBD)
● Other repos (automation, new projects, etc.):
− IQStack: https://github.com/iqstack - managed by me (disclosure)
− Personal: https://github.com/cloudm2 - me 
− Ansible: https://github.com/ceph/ceph-ansible
− Chef: https://github.com/ceph/ceph-cookbook - this one is going through a
major overhaul and also managed by me for Ceph

Ceph Day Chicago - Ceph at work at Bloomberg

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Ceph Day Chicago - Ceph at work at Bloomberg

Ähnlich wie Ceph Day Chicago - Ceph at work at Bloomberg (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ceph Day Chicago - Ceph at work at Bloomberg