SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Ceph:
Massively Scalable Distributed Storage
Patrick McGarry
Director Community, Inktank
Who am I?

●

Patrick McGarry

●

Dir Community, Inktank

●

/. → ALU → P4 → Inktank

●

@scuttlemonkey

●

patrick@inktank.com
Outline
●

What is Ceph?

●

Ceph History

●

How does Ceph work?

●

CRUSH Fundamentals

●

Applications

●

Testing

●

Performance

●

Moving forward

●

Questions
What is Ceph?
Ceph at a glance

So ftw a re

A ll- in - 1

CR U SH

S c a le

O n c o m m o d ity h a rd w a re

O b je c t , b lo c k , a n d file

A w esom esauce

H uge and b eyond

C e p h ca n ru n o n a n y
in f r a s t r u c t u r e , m e t a l
o r v ir t u a liz e d t o
p r o v id e a c h e a p a n d
p o w e rfu l sto ra g e
c lu s t e r .

L o w o ve rh e a d
d o e sn ’t m e a n ju st
h a r d w a r e , it m e a n s
p e o p le t o o !

In f r a s t r u c t u r e - a w a r e
p la c e m e n t a lg o r it h m
a llo w s y o u t o d o
r e a lly c o o l s t u f f .

D e s ig n e d f o r
e x a b yte , cu rre n t
im p le m e n t a t io n s in
t h e m u lt i- p e t a b y t e .
H P C , B ig D a t a ,
C lo u d , r a w s t o r a g e .
Unified storage system
●

objects
●

●

●

native
RESTful

block
●

●

thin provisioning, snapshots, cloning

file
●

strong consistency, snapshots
Distributed storage system
●

data center scale
●

●

●

10s to 10,000s of machines
terabytes to exabytes

fault tolerant
●

●

●

no single point of failure
commodity hardware

self-managing, self-healing
Where did Ceph come from?
UCSC research grant
●

“Petascale object storage”
●

DOE: LANL, LLNL, Sandia

●

Scalability

●

Reliability

●

Performance
●

●

Raw IO bandwidth, metadata ops/sec

HPC file system workloads
●

Thousands of clients writing to same file, directory
Distributed metadata management
●

Innovative design
●

Subtree-based partitioning for locality, efficiency

●

Dynamically adapt to current workload

●

Embedded inodes

●

Prototype simulator in Java (2004)

●

First line of Ceph code
●

Summer internship at LLNL

●

High security national lab environment

●

Could write anything, as long as it was OSS
The rest of Ceph
●

RADOS – distributed object storage cluster (2005)

●

EBOFS – local object storage (2004/2006)

●

CRUSH – hashing for the real world (2005)

●

Paxos monitors – cluster consensus (2006)
→ emphasis on consistent, reliable storage
→ scale by pushing intelligence to the edges
→ a different but compelling architecture
Industry black hole
●

Many large storage vendors
●

●

Proprietary solutions that don't scale well

Few open source alternatives (2006)
●

●

Limited community and architecture (Lustre)

●

●

Very limited scale, or
No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems...
●

●

...and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project
A different path
●

Change the world with open source
●

●

●

Do what Linux did to Solaris, Irix, Ultrix, etc.
What could go wrong?

License
●

●

●

GPL, BSD...
LGPL: share changes, okay to link to proprietary code

Avoid unsavory practices
●

Dual licensing

●

Copyright assignment
How does Ceph work?
APP
APP

LIBRADOS
LIBRADOS

APP
APP

RADOSGW
RADOSGW

AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
and Swift
access RADOS,
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP

HOST/VM
HOST/VM

CLIENT
CLIENT

RBD
RBD

CEPH FS
CEPH FS

AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver

AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE

RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
Why start with objects?
●

more useful than (disk) blocks
●

●

variable size

●

●

names in a single flat namespace
simple API with rich semantics

more scalable than files
●

no hard-to-distribute hierarchy

●

update semantics do not span objects

●

workload is trivially parallel
Ceph object model
●

pools
●

●

independent namespaces or object collections

●

●

1s to 100s
replication level, placement policy

objects
●

bazillions

●

blob of data (bytes to gigabytes)

●

attributes (e.g., “version=12”; bytes to kilobytes)

●

key/value bundle (bytes to gigabytes)
OSD

OSD

OSD

OSD

OSD

FS

FS

FS

FS

FS

DISK

DISK

DISK

DISK

DISK

M

M

M

btrfs
xfs
ext4
Object Storage Daemons (OSDs)
●

10s to 10000s in a cluster

●

one per disk, SSD, or RAID group, or ...
●

●

●

hardware agnostic

serve stored objects to clients
intelligently peer to perform replication and
recovery tasks

Monitors

M

●

●

maintain cluster membership and state
provide consensus for distributed decisionmaking

●

small, odd number

●

these do not served stored objects to clients
Data distribution
●

●

●

all objects are replicated N times
objects are automatically placed, balanced, migrated in a
dynamic cluster
must consider physical infrastructure
●

●

ceph-osds on hosts in racks in rows in data centers

three approaches
●

pick a spot; remember where you put it

●

pick a spot; write down where you put it

●

calculate where to put it, where to find it
CRUSH fundamentals
CRUSH
●

pseudo-random placement algorithm
●

fast calculation, no lookup

●

repeatable, deterministic

●

statistically uniform distribution

●

stable mapping
●

●

limited data migration on change

rule-based configuration
●

infrastructure topology aware

●

adjustable replication

●

allows weighting
[Mapping]

[Buckets]

[Rules]

[Map]
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg
10
10

10
10

01
01

01
01

10
10

10
10

01
01

1111

01
01

10
10

CRUSH(pg, cluster state, policy)
10 10 01 01 10 10 01 11 01 10
10 10 01 01 10 10 01 11 01 10

10
10

10
10

01
01

01
01

10
10

10
10

01
01

1111

01
01

10
10
CLIENT
CLIENT
??
CLIENT
??
Applications
APP
APP

LIBRADOS
LIBRADOS

APP
APP

RADOSGW
RADOSGW

AA bucket-based
bucket-based
AA library allowing REST gateway,
library allowing
REST gateway,
apps to directly
compatible with S3
apps to directly
compatible with S3
access RADOS,
and Swift
access RADOS,
and Swift
with support for
with support for
C, C++, Java,
C, C++, Java,
Python, Ruby,
Python, Ruby,
and PHP
and PHP

HOST/VM
HOST/VM

CLIENT
CLIENT

RBD
RBD

CEPH FS
CEPH FS

AA reliable and fullyreliable and fullydistributed block
distributed block
device, with a a Linux
device, with Linux
kernel client and a a
kernel client and
QEMU/KVM driver
QEMU/KVM driver

AA POSIX-compliant
POSIX-compliant
distributed file system,
distributed file system,
with a a Linux kernel
with Linux kernel
client and support for
client and support for
FUSE
FUSE

RADOS
RADOS
AA reliable, autonomous, distributed object store comprised of self-healing, self-managing,
reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
intelligent storage nodes
RADOS Gateway
●

REST-based object storage proxy

●

uses RADOS to store objects

●

API supports buckets, accounting

●

usage accounting for billing purposes

●

compatible with S3, Swift APIs
RADOS Block Device
●

storage of disk images in RADOS

●

decouple VM from host

●

images striped across entire cluster (pool)

●

snapshots

●

copy-on-write clones

●

support in
●

Qemu/KVM, Xen

●

mainline Linux kernel (2.6.34+)

●

OpenStack, CloudStack

●

Ganeti, OpenNebula
Metadata Server (MDS)
●

manages metadata for POSIX shared file system
●

directory hierarchy

●

file metadata (size, owner, timestamps)

●

stores metadata in RADOS

●

does not serve file data to clients

●

only required for the shared file system
Multiple protocols, implementations
●

Linux kernel client
●

mount -t ceph 1.2.3.4:/ /mnt

●

export (NFS), Samba (CIFS)

●

ceph-fuse

●

libcephfs.so
●

your app

●

Samba (CIFS)

●

Ganesha (NFS)

●

SMB/CIFS

NFS

Ganesha
libcephfs

Samba
libcephfs

Hadoop
libcephfs

your app
libcephfs

Hadoop (map/reduce)
ceph

ceph-fuse
fuse
kernel
Testing
Package Building
●

GitBuilder
●

●

Very basic tool, works well though

●

Push to get → .deb && .rpm for all platforms

●

●

Build packages / tarballs

autosigned

Pbuilder
●

Used for major releases

●

Clean room type of build

●

Signed using different key

●

As many distros as possible (All Debian, Ubuntu, Fedora 17&18,
CentOS, SLES, OpenSUSE, tarball)
Teuthology
●

Our own custom test framework

●

Allocates machines you define to cluster you define

●

Runs tasks against that environment
●

Set up ceph

●

Run workloads

●

Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount
it → run workload

●

Makes it easy to stack things up, map them to hosts, run

●

Suite of tests on our github

●

Active community development underway!
Performance
4K Relative Performance

All Workloads: QEMU/KVM, BTRFS, or XFS
Read Heavy Only Workloads: Kernel RBD, EXT4, or XFS
128K Relative Performance

All Workloads: QEMU/KVM, or XFS
Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
4M Relative Performance

All Workloads: QEMU/KVM, or XFS
Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
What does this mean?
●

Anecdotal results are great!

●

Concrete evidence coming soon

●

Performance is complicated
●

Workload

●

Hardware (especially network!)

●

Lots and lots of tuneables

●

Mark Nelson (nhm on #ceph) is the performance geek

●

Lots of reading on ceph.com blog

●

We welcome testing and comparison help
●

Anyone want to help with an EMC / NetApp showdown?
Moving forward
Future plans

G e o - R e p lic a t io n

E r a s u r e C o d in g

T ie r in g

G o v e rn a n c e

A n o n g o in g p r o c e s s

R e c e p t io n e f f ic ie n c y

H e a d e d t o d y n a m ic

M a k in g it o p e n - e r

W h ile t h e f ir s t p a s s
f o r d is a s t e r r e c o v e r y
is d o n e , w e w a n t t o
g e t t o b u ilt - in ,
w o r ld - w id e
r e p lic a t io n .

C u r r e n t ly u n d e r w a y
in t h e c o m m u n it y !

C a n a lr e a d y d o t h is
in a s t a t ic p o o lb a s e d s e t u p . L o o k in g
to g e t to a u se -b a se d
m ig r a t io n .

B e e n t a lk in g a b o u t it
f o r e v e r. T h e t i m e i s
c o m in g !
Get Involved!

CDS

Q u a r t e r ly O n li n e S u m m i t

Ceph Day

IR C

N o t ju st fo r N Y C

G e e k -o n -d u ty

E m a i l m a k e s t h e w o r ld g o

O n lin e s u m m it p u t s
th e c o re d e vs
t o g e t h e r w it h t h e
C e p h c o m m u n it y.

M o r e p la n n e d ,
in c lu d in g S a n t a C la r a
and London. Keep an
eye out:

D u r in g t h e w e e k
t h e r e a r e t im e s
w h e n C e p h e x p e rts
a r e a v a ila b le t o
h e lp . S t o p b y
o ftc.n e t/ ce p h

O u r m a ilin g lis t s a r e
v e r y a c t iv e , c h e c k
o u t ce p h .co m fo r
d e t a ils o n h o w t o
j o in in !

h ttp : / / in k ta n k . c o m / c e p h d a y
s/

L is t s
E - M A IL
p a tr ic k @ in k ta n k . c o m

W E B S IT E
C ep h .co m

S O C IA L
@ s c u t t le m o n k e y
@ ceph
F a ce b o o k .co m / ce p h sto ra g e

Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
What's new in Luminous and Beyond
What's new in Luminous and BeyondWhat's new in Luminous and Beyond
What's new in Luminous and Beyond
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
Intorduce to Ceph
Intorduce to CephIntorduce to Ceph
Intorduce to Ceph
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Java in containers
Java in containersJava in containers
Java in containers
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 

Ähnlich wie DEVIEW 2013

Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Community
 

Ähnlich wie DEVIEW 2013 (20)

Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
 
London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph London Ceph Day Keynote: Building Tomorrow's Ceph
London Ceph Day Keynote: Building Tomorrow's Ceph
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Seoul - Ceph: a decade in the making and still going strong
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Docker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12xDocker and-containers-for-development-and-deployment-scale12x
Docker and-containers-for-development-and-deployment-scale12x
 
Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016Running OpenStack in Production - Barcamp Saigon 2016
Running OpenStack in Production - Barcamp Saigon 2016
 
Cncf meetup-rook
Cncf meetup-rookCncf meetup-rook
Cncf meetup-rook
 

Mehr von Patrick McGarry

Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Patrick McGarry
 

Mehr von Patrick McGarry (16)

Ceph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to DeploymentCeph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to Deployment
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Bluestore
BluestoreBluestore
Bluestore
 
Community Update
Community UpdateCommunity Update
Community Update
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk
 
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
 
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
 
Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Ceph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston MeetupCeph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston Meetup
 
Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013
 
Powering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - ApacheconPowering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - Apachecon
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

DEVIEW 2013

  • 1. Ceph: Massively Scalable Distributed Storage Patrick McGarry Director Community, Inktank
  • 2. Who am I? ● Patrick McGarry ● Dir Community, Inktank ● /. → ALU → P4 → Inktank ● @scuttlemonkey ● patrick@inktank.com
  • 3. Outline ● What is Ceph? ● Ceph History ● How does Ceph work? ● CRUSH Fundamentals ● Applications ● Testing ● Performance ● Moving forward ● Questions
  • 5. Ceph at a glance So ftw a re A ll- in - 1 CR U SH S c a le O n c o m m o d ity h a rd w a re O b je c t , b lo c k , a n d file A w esom esauce H uge and b eyond C e p h ca n ru n o n a n y in f r a s t r u c t u r e , m e t a l o r v ir t u a liz e d t o p r o v id e a c h e a p a n d p o w e rfu l sto ra g e c lu s t e r . L o w o ve rh e a d d o e sn ’t m e a n ju st h a r d w a r e , it m e a n s p e o p le t o o ! In f r a s t r u c t u r e - a w a r e p la c e m e n t a lg o r it h m a llo w s y o u t o d o r e a lly c o o l s t u f f . D e s ig n e d f o r e x a b yte , cu rre n t im p le m e n t a t io n s in t h e m u lt i- p e t a b y t e . H P C , B ig D a t a , C lo u d , r a w s t o r a g e .
  • 6. Unified storage system ● objects ● ● ● native RESTful block ● ● thin provisioning, snapshots, cloning file ● strong consistency, snapshots
  • 7. Distributed storage system ● data center scale ● ● ● 10s to 10,000s of machines terabytes to exabytes fault tolerant ● ● ● no single point of failure commodity hardware self-managing, self-healing
  • 8. Where did Ceph come from?
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. UCSC research grant ● “Petascale object storage” ● DOE: LANL, LLNL, Sandia ● Scalability ● Reliability ● Performance ● ● Raw IO bandwidth, metadata ops/sec HPC file system workloads ● Thousands of clients writing to same file, directory
  • 16. Distributed metadata management ● Innovative design ● Subtree-based partitioning for locality, efficiency ● Dynamically adapt to current workload ● Embedded inodes ● Prototype simulator in Java (2004) ● First line of Ceph code ● Summer internship at LLNL ● High security national lab environment ● Could write anything, as long as it was OSS
  • 17. The rest of Ceph ● RADOS – distributed object storage cluster (2005) ● EBOFS – local object storage (2004/2006) ● CRUSH – hashing for the real world (2005) ● Paxos monitors – cluster consensus (2006) → emphasis on consistent, reliable storage → scale by pushing intelligence to the edges → a different but compelling architecture
  • 18. Industry black hole ● Many large storage vendors ● ● Proprietary solutions that don't scale well Few open source alternatives (2006) ● ● Limited community and architecture (Lustre) ● ● Very limited scale, or No enterprise feature sets (snapshots, quotas) PhD grads all built interesting systems... ● ● ...and then went to work for Netapp, DDN, EMC, Veritas. They want you, not your project
  • 19. A different path ● Change the world with open source ● ● ● Do what Linux did to Solaris, Irix, Ultrix, etc. What could go wrong? License ● ● ● GPL, BSD... LGPL: share changes, okay to link to proprietary code Avoid unsavory practices ● Dual licensing ● Copyright assignment
  • 20. How does Ceph work?
  • 21. APP APP LIBRADOS LIBRADOS APP APP RADOSGW RADOSGW AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, and Swift access RADOS, and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP HOST/VM HOST/VM CLIENT CLIENT RBD RBD CEPH FS CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver AA POSIX-compliant POSIX-compliant distributed file system, distributed file system, with a a Linux kernel with Linux kernel client and support for client and support for FUSE FUSE RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 22. Why start with objects? ● more useful than (disk) blocks ● ● variable size ● ● names in a single flat namespace simple API with rich semantics more scalable than files ● no hard-to-distribute hierarchy ● update semantics do not span objects ● workload is trivially parallel
  • 23. Ceph object model ● pools ● ● independent namespaces or object collections ● ● 1s to 100s replication level, placement policy objects ● bazillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 25. Object Storage Daemons (OSDs) ● 10s to 10000s in a cluster ● one per disk, SSD, or RAID group, or ... ● ● ● hardware agnostic serve stored objects to clients intelligently peer to perform replication and recovery tasks Monitors M ● ● maintain cluster membership and state provide consensus for distributed decisionmaking ● small, odd number ● these do not served stored objects to clients
  • 26. Data distribution ● ● ● all objects are replicated N times objects are automatically placed, balanced, migrated in a dynamic cluster must consider physical infrastructure ● ● ceph-osds on hosts in racks in rows in data centers three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 28. CRUSH ● pseudo-random placement algorithm ● fast calculation, no lookup ● repeatable, deterministic ● statistically uniform distribution ● stable mapping ● ● limited data migration on change rule-based configuration ● infrastructure topology aware ● adjustable replication ● allows weighting
  • 30. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg 10 10 10 10 01 01 01 01 10 10 10 10 01 01 1111 01 01 10 10 CRUSH(pg, cluster state, policy)
  • 31. 10 10 01 01 10 10 01 11 01 10 10 10 01 01 10 10 01 11 01 10 10 10 10 10 01 01 01 01 10 10 10 10 01 01 1111 01 01 10 10
  • 33.
  • 34.
  • 37. APP APP LIBRADOS LIBRADOS APP APP RADOSGW RADOSGW AA bucket-based bucket-based AA library allowing REST gateway, library allowing REST gateway, apps to directly compatible with S3 apps to directly compatible with S3 access RADOS, and Swift access RADOS, and Swift with support for with support for C, C++, Java, C, C++, Java, Python, Ruby, Python, Ruby, and PHP and PHP HOST/VM HOST/VM CLIENT CLIENT RBD RBD CEPH FS CEPH FS AA reliable and fullyreliable and fullydistributed block distributed block device, with a a Linux device, with Linux kernel client and a a kernel client and QEMU/KVM driver QEMU/KVM driver AA POSIX-compliant POSIX-compliant distributed file system, distributed file system, with a a Linux kernel with Linux kernel client and support for client and support for FUSE FUSE RADOS RADOS AA reliable, autonomous, distributed object store comprised of self-healing, self-managing, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 38. RADOS Gateway ● REST-based object storage proxy ● uses RADOS to store objects ● API supports buckets, accounting ● usage accounting for billing purposes ● compatible with S3, Swift APIs
  • 39. RADOS Block Device ● storage of disk images in RADOS ● decouple VM from host ● images striped across entire cluster (pool) ● snapshots ● copy-on-write clones ● support in ● Qemu/KVM, Xen ● mainline Linux kernel (2.6.34+) ● OpenStack, CloudStack ● Ganeti, OpenNebula
  • 40. Metadata Server (MDS) ● manages metadata for POSIX shared file system ● directory hierarchy ● file metadata (size, owner, timestamps) ● stores metadata in RADOS ● does not serve file data to clients ● only required for the shared file system
  • 41. Multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt ● export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Samba (CIFS) ● Ganesha (NFS) ● SMB/CIFS NFS Ganesha libcephfs Samba libcephfs Hadoop libcephfs your app libcephfs Hadoop (map/reduce) ceph ceph-fuse fuse kernel
  • 43. Package Building ● GitBuilder ● ● Very basic tool, works well though ● Push to get → .deb && .rpm for all platforms ● ● Build packages / tarballs autosigned Pbuilder ● Used for major releases ● Clean room type of build ● Signed using different key ● As many distros as possible (All Debian, Ubuntu, Fedora 17&18, CentOS, SLES, OpenSUSE, tarball)
  • 44. Teuthology ● Our own custom test framework ● Allocates machines you define to cluster you define ● Runs tasks against that environment ● Set up ceph ● Run workloads ● Pull from SAMBA gitbuilder → mount Ceph → run NFS client → mount it → run workload ● Makes it easy to stack things up, map them to hosts, run ● Suite of tests on our github ● Active community development underway!
  • 46. 4K Relative Performance All Workloads: QEMU/KVM, BTRFS, or XFS Read Heavy Only Workloads: Kernel RBD, EXT4, or XFS
  • 47. 128K Relative Performance All Workloads: QEMU/KVM, or XFS Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
  • 48. 4M Relative Performance All Workloads: QEMU/KVM, or XFS Read Heavy Only Workloads: Kernel RBD, BTRFS, or EXT4
  • 49. What does this mean? ● Anecdotal results are great! ● Concrete evidence coming soon ● Performance is complicated ● Workload ● Hardware (especially network!) ● Lots and lots of tuneables ● Mark Nelson (nhm on #ceph) is the performance geek ● Lots of reading on ceph.com blog ● We welcome testing and comparison help ● Anyone want to help with an EMC / NetApp showdown?
  • 51. Future plans G e o - R e p lic a t io n E r a s u r e C o d in g T ie r in g G o v e rn a n c e A n o n g o in g p r o c e s s R e c e p t io n e f f ic ie n c y H e a d e d t o d y n a m ic M a k in g it o p e n - e r W h ile t h e f ir s t p a s s f o r d is a s t e r r e c o v e r y is d o n e , w e w a n t t o g e t t o b u ilt - in , w o r ld - w id e r e p lic a t io n . C u r r e n t ly u n d e r w a y in t h e c o m m u n it y ! C a n a lr e a d y d o t h is in a s t a t ic p o o lb a s e d s e t u p . L o o k in g to g e t to a u se -b a se d m ig r a t io n . B e e n t a lk in g a b o u t it f o r e v e r. T h e t i m e i s c o m in g !
  • 52. Get Involved! CDS Q u a r t e r ly O n li n e S u m m i t Ceph Day IR C N o t ju st fo r N Y C G e e k -o n -d u ty E m a i l m a k e s t h e w o r ld g o O n lin e s u m m it p u t s th e c o re d e vs t o g e t h e r w it h t h e C e p h c o m m u n it y. M o r e p la n n e d , in c lu d in g S a n t a C la r a and London. Keep an eye out: D u r in g t h e w e e k t h e r e a r e t im e s w h e n C e p h e x p e rts a r e a v a ila b le t o h e lp . S t o p b y o ftc.n e t/ ce p h O u r m a ilin g lis t s a r e v e r y a c t iv e , c h e c k o u t ce p h .co m fo r d e t a ils o n h o w t o j o in in ! h ttp : / / in k ta n k . c o m / c e p h d a y s/ L is t s
  • 53. E - M A IL p a tr ic k @ in k ta n k . c o m W E B S IT E C ep h .co m S O C IA L @ s c u t t le m o n k e y @ ceph F a ce b o o k .co m / ce p h sto ra g e Questions?