SlideShare ist ein Scribd-Unternehmen logo
1 von 30
an intro to ceph and big data
patrick mcgarry – inktank
Big Data Workshop – 27 JUN 2013
what is ceph?

distributed storage system
− reliable system built with unreliable components
− fault tolerant, no SPoF

commodity hardware
− expensive arrays, controllers, specialized
networks not required

large scale (10s to 10,000s of nodes)
− heterogenous hardware (no fork-lift upgrades)
− incremental expansion (or contraction)

dynamic cluster
what is ceph?

unified storage platform
− scalable object + compute storage platform
− RESTful object storage (e.g., S3, Swift)
− block storage
− distributed file system

open source
− LGPL server-side
− client support in mainline Linux kernel
RADOS – the Ceph object store
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS – the Ceph object store
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
RADOSGW
A bucket-based REST
gateway, compatible
with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
DISK
FS
DISK DISK
OSD
DISK DISK
OSD OSD OSD OSD
FS FS FSFS btrfs
xfs
ext4
zfs?
MMM
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
hash(object name) % num pg
CRUSH(pg, cluster state, policy)
1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
CLIENTCLIENT
??
CLIENT
??
So what about big data?
 CephFS
 s/HDFS/CephFS/g
 Object Storage
 Key-value store
LL
librados

direct access to
RADOS from
applications

C, C++, Python, PHP,
Java, Erlang

direct access to
storage nodes

no HTTP overhead

efficient key/value storage inside an object

atomic single-object transactions
− update data, attr, keys together
− atomic compare-and-swap

object-granularity snapshot infrastructure

inter-client communication via object

embed code in ceph-osd daemon via plugin
API
− arbitrary atomic object mutations, processing
rich librados API
Data and compute

RADOS Embedded Object Classes

Moves compute directly adjacent to data

C++ by default

Lua bindings available
die, POSIX, die

successful exascale architectures will replace
or transcend POSIX
− hierarchical model does not distribute

line between compute and storage will blur
− some processes is data-local, some is not

fault tolerance will be first-class property of
architecture
− for both computation and storage
POSIX – I'm not dead yet!

CephFS builds POSIX namespace on top of
RADOS
− metadata managed by ceph-mds daemons
− stored in objects

strong consistency, stateful client protocol
− heavy prefetching, embedded inodes

architected for HPC workloads
− distribute namespace across cluster of MDSs
− mitigate bursty workloads
− adapt distribution as workloads shift over time
MM
MM
MM
CLIENTCLIENT
01
10
01
10
data
metadata
MM
MM
MM
one tree
three metadata servers
??
DYNAMIC SUBTREE PARTITIONING
recursive accounting

ceph-mds tracks recursive directory stats
− file sizes
− file and directory counts
− modification time

efficient$ ls -alSh | head
total 0
drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 .
drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 ..
drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph
drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1
drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko
drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest
drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2
drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph
drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
snapshots

snapshot arbitrary subdirectories

simple interface
− hidden '.snap' directory
− no special tools
$ mkdir foo/.snap/one # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776 # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one # remove snapshot
how can you help?

try ceph and tell us what you think
− http://ceph.com/resources/downloads

http://ceph.com/resources/mailing-list-irc/
− ask if you need help

ask your organization to start dedicating
resources to the project http://github.com/ceph

find a bug (http://tracker.ceph.com) and fix it

participate in our ceph developer summit
− http://ceph.com/events/ceph-developer-summit
questions?
thanks
patrick mcgarry
patrick@inktank.com
@scuttlemonkey
http://github.com/ceph
http://ceph.com/

Weitere ähnliche Inhalte

Was ist angesagt?

QCT Fact Sheet-English
QCT Fact Sheet-EnglishQCT Fact Sheet-English
QCT Fact Sheet-English
Peggy Ho
 

Was ist angesagt? (19)

SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014
 
librados
libradoslibrados
librados
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
What is a Ceph (and why do I care). OpenStack storage - Colorado OpenStack Me...
 
QCT Fact Sheet-English
QCT Fact Sheet-EnglishQCT Fact Sheet-English
QCT Fact Sheet-English
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Cephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmarkCephfs jewel mds performance benchmark
Cephfs jewel mds performance benchmark
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
Ceph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud worldCeph data services in a multi- and hybrid cloud world
Ceph data services in a multi- and hybrid cloud world
 

Andere mochten auch

Top 8 chief business development officer resume samples
Top 8 chief business development officer resume samplesTop 8 chief business development officer resume samples
Top 8 chief business development officer resume samples
porichfergu
 
KCB May 2008 Cover
KCB May 2008 CoverKCB May 2008 Cover
KCB May 2008 Cover
rsmacintosh
 
Ths general biology unit 1 our environment living requirements notes_v1516
Ths general biology unit 1 our environment living requirements notes_v1516Ths general biology unit 1 our environment living requirements notes_v1516
Ths general biology unit 1 our environment living requirements notes_v1516
rozeka01
 
What are bleached knots
What are bleached knotsWhat are bleached knots
What are bleached knots
Max Lee
 

Andere mochten auch (20)

Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
A crash course in CRUSH
A crash course in CRUSHA crash course in CRUSH
A crash course in CRUSH
 
Ceph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide DeckCeph Days 2014 Paul Evans Slide Deck
Ceph Days 2014 Paul Evans Slide Deck
 
Amazon s3
Amazon s3Amazon s3
Amazon s3
 
Deploying datacenters with Puppet - PuppetCamp Europe 2010
Deploying datacenters with Puppet - PuppetCamp Europe 2010Deploying datacenters with Puppet - PuppetCamp Europe 2010
Deploying datacenters with Puppet - PuppetCamp Europe 2010
 
Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...
Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...
Intelligence Artificielle: résolution de problèmes en Prolog ou Prolog pour l...
 
Construire un moteur d'inférence
Construire un moteur d'inférenceConstruire un moteur d'inférence
Construire un moteur d'inférence
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
CEPH introduction , Bootstrapping your first Ceph cluster in just 10 minutes
CEPH introduction , Bootstrapping your first Ceph cluster in just 10 minutesCEPH introduction , Bootstrapping your first Ceph cluster in just 10 minutes
CEPH introduction , Bootstrapping your first Ceph cluster in just 10 minutes
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
Top 8 chief business development officer resume samples
Top 8 chief business development officer resume samplesTop 8 chief business development officer resume samples
Top 8 chief business development officer resume samples
 
KCB May 2008 Cover
KCB May 2008 CoverKCB May 2008 Cover
KCB May 2008 Cover
 
E Mail Management In Statoil Norsk Arkivrad October 2010
E Mail Management In Statoil   Norsk Arkivrad October 2010E Mail Management In Statoil   Norsk Arkivrad October 2010
E Mail Management In Statoil Norsk Arkivrad October 2010
 
Guia de estudio
Guia de estudioGuia de estudio
Guia de estudio
 
Ths general biology unit 1 our environment living requirements notes_v1516
Ths general biology unit 1 our environment living requirements notes_v1516Ths general biology unit 1 our environment living requirements notes_v1516
Ths general biology unit 1 our environment living requirements notes_v1516
 
Prescrition co-owners- pdf
Prescrition   co-owners- pdfPrescrition   co-owners- pdf
Prescrition co-owners- pdf
 
MBA study material- Ethics
MBA study material- EthicsMBA study material- Ethics
MBA study material- Ethics
 
Freefixer log
Freefixer logFreefixer log
Freefixer log
 
What are bleached knots
What are bleached knotsWhat are bleached knots
What are bleached knots
 
ASI Financials 2011 Brochure
ASI Financials 2011 BrochureASI Financials 2011 Brochure
ASI Financials 2011 Brochure
 

Ähnlich wie An intro to Ceph and big data - CERN Big Data Workshop

Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
Rodrigo Missiaggia
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
Ceph Community
 

Ähnlich wie An intro to Ceph and big data - CERN Big Data Workshop (20)

Red hat ceph storage customer presentation
Red hat ceph storage customer presentationRed hat ceph storage customer presentation
Red hat ceph storage customer presentation
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red HatThe Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
New use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis RicoNew use cases for Ceph, beyond OpenStack, Luis Rico
New use cases for Ceph, beyond OpenStack, Luis Rico
 
Red Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS PlansRed Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS Plans
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph StorageRed Hat Storage Day New York - What's New in Red Hat Ceph Storage
Red Hat Storage Day New York - What's New in Red Hat Ceph Storage
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Scalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the CloudScalable POSIX File Systems in the Cloud
Scalable POSIX File Systems in the Cloud
 

Mehr von Patrick McGarry

Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Patrick McGarry
 

Mehr von Patrick McGarry (16)

Ceph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to DeploymentCeph@MIMOS: Growing Pains from R&D to Deployment
Ceph@MIMOS: Growing Pains from R&D to Deployment
 
Bluestore
BluestoreBluestore
Bluestore
 
Community Update
Community UpdateCommunity Update
Community Update
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph: A decade in the making and still going strong
Ceph: A decade in the making and still going strongCeph: A decade in the making and still going strong
Ceph: A decade in the making and still going strong
 
2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk2014 Ceph NYLUG Talk
2014 Ceph NYLUG Talk
 
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
Ceph, Open Source, and the Path to Ubiquity in Storage - AACS Meetup 2014
 
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
Ceph Ecosystem Update - Ceph Day Frankfurt (Feb 2014)
 
DEVIEW 2013
DEVIEW 2013DEVIEW 2013
DEVIEW 2013
 
Ceph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper MeliorCeph, Xen, and CloudStack: Semper Melior
Ceph, Xen, and CloudStack: Semper Melior
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Ceph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston MeetupCeph & OpenStack - Boston Meetup
Ceph & OpenStack - Boston Meetup
 
Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013Ceph in the Ecosystem - Ceph Day NYC 2013
Ceph in the Ecosystem - Ceph Day NYC 2013
 
Powering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - ApacheconPowering CloudStack with Ceph RBD - Apachecon
Powering CloudStack with Ceph RBD - Apachecon
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

An intro to Ceph and big data - CERN Big Data Workshop

  • 1. an intro to ceph and big data patrick mcgarry – inktank Big Data Workshop – 27 JUN 2013
  • 2. what is ceph?  distributed storage system − reliable system built with unreliable components − fault tolerant, no SPoF  commodity hardware − expensive arrays, controllers, specialized networks not required  large scale (10s to 10,000s of nodes) − heterogenous hardware (no fork-lift upgrades) − incremental expansion (or contraction)  dynamic cluster
  • 3. what is ceph?  unified storage platform − scalable object + compute storage platform − RESTful object storage (e.g., S3, Swift) − block storage − distributed file system  open source − LGPL server-side − client support in mainline Linux kernel
  • 4. RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS – the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
  • 5. DISK FS DISK DISK OSD DISK DISK OSD OSD OSD OSD FS FS FSFS btrfs xfs ext4 zfs? MMM
  • 6. 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010 hash(object name) % num pg CRUSH(pg, cluster state, policy)
  • 7. 1010 1010 0101 0101 1010 1010 0101 1111 0101 1010
  • 9.
  • 10.
  • 12. So what about big data?  CephFS  s/HDFS/CephFS/g  Object Storage  Key-value store
  • 13. LL librados  direct access to RADOS from applications  C, C++, Python, PHP, Java, Erlang  direct access to storage nodes  no HTTP overhead
  • 14.  efficient key/value storage inside an object  atomic single-object transactions − update data, attr, keys together − atomic compare-and-swap  object-granularity snapshot infrastructure  inter-client communication via object  embed code in ceph-osd daemon via plugin API − arbitrary atomic object mutations, processing rich librados API
  • 15. Data and compute  RADOS Embedded Object Classes  Moves compute directly adjacent to data  C++ by default  Lua bindings available
  • 16. die, POSIX, die  successful exascale architectures will replace or transcend POSIX − hierarchical model does not distribute  line between compute and storage will blur − some processes is data-local, some is not  fault tolerance will be first-class property of architecture − for both computation and storage
  • 17. POSIX – I'm not dead yet!  CephFS builds POSIX namespace on top of RADOS − metadata managed by ceph-mds daemons − stored in objects  strong consistency, stateful client protocol − heavy prefetching, embedded inodes  architected for HPC workloads − distribute namespace across cluster of MDSs − mitigate bursty workloads − adapt distribution as workloads shift over time
  • 21.
  • 22.
  • 23.
  • 24.
  • 26. recursive accounting  ceph-mds tracks recursive directory stats − file sizes − file and directory counts − modification time  efficient$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwxr-xr-x 1 mcg_test1 pg2419992 23G 2011-02-02 08:57 mcg_test1 drwx--x--- 1 luko adm 19G 2011-01-21 12:17 luko drwx--x--- 1 eest adm 14G 2011-02-04 16:29 eest drwxr-xr-x 1 mcg_test2 pg2419992 3.0G 2011-02-02 09:34 mcg_test2 drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph
  • 27. snapshots  snapshot arbitrary subdirectories  simple interface − hidden '.snap' directory − no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 28. how can you help?  try ceph and tell us what you think − http://ceph.com/resources/downloads  http://ceph.com/resources/mailing-list-irc/ − ask if you need help  ask your organization to start dedicating resources to the project http://github.com/ceph  find a bug (http://tracker.ceph.com) and fix it  participate in our ceph developer summit − http://ceph.com/events/ceph-developer-summit

Hinweis der Redaktion

  1. RADOS is a distributed object store, and it’s the foundation for Ceph. On top of RADOS, the Ceph team has built three applications that allow you to store data and do fantastic things. But before we get into all of that, let’s start at the beginning of the story.
  2. Let’s start with RADOS, Reliable Autonomic Distributed Object Storage. In this example, you’ve got five disks in a computer. You have initialized each disk with a filesystem (btrfs is the right filesystem to use someday, but until it’s stable we recommend XFS). On each filesystem, you deploy a Ceph OSD (Object Storage Daemon). That computer, with its five disks and five object storage daemons, becomes a single node in a RADOS cluster. Alongside these nodes are monitor nodes, which keep track of the current state of the cluster and provide users with an entry point into the cluster (although they do not serve any data themselves).
  3. With CRUSH, the data is first split into a certain number of sections. These are called “placement groups”. The number of placement groups is configurable. Then, the CRUSH algorithm runs, having received the latest cluster map and a set of placement rules, and it determines where the placement group belongs in the cluster. This is a pseudo-random calculation, but it’s also repeatable; given the same cluster state and rule set, it will always return the same results.
  4. Each placement group is run through CRUSH and stored in the cluster. Notice how no node has received more than one copy of a placement group, and no two nodes contain the same information? That’s important.
  5. When it comes time to store an object in the cluster (or retrieve one), the client calculates where it belongs.
  6. What happens, though, when a node goes down? The OSDs are always talking to each other (and the monitors), and they know when something is amiss. The third and fifth node on the top row have noticed that the second node on the bottom row is gone, and they are also aware that they have replicas of the missing data.
  7. The OSDs collectively use the CRUSH algorithm to determine how the cluster should look based on its new state, and move the data to where clients running CRUSH expect it to be.
  8. Because of the way placement is calculated instead of centrally controlled, node failures are transparent to clients.
  9. Most people will default to discussions about CephFS when confronted with either Big Data or HPC applications. This can mean using CephFS by itself, or perhaps as a drop-in replacement for HDFS. [NOT READY ARGUMENT] There are a couple of other options, however. You can use librados to talk directly to the object store. One user I know actually plugged Hadoop in at this level, instead of using CephFS. Ceph also has a pretty decent key-value store proof-of-concept done by an intern last year. It's based on a b-tree structure but uses a fixed height of two levels instead of a true tree structure. This draws from both a normal B-Tree and Google BigTable. Would love to see someone do more with it.
  10. I mentioned librados, this is the low-level library that allows you to directly access a RADOS cluster from your application. This has native language bindings for C, C++, Python, etc. This is obviously the fastest way to get at your data and comes with no inherent overheard or translation layer.
  11. For most object systems an object is just a bunch of bytes, maybe some extended attributes. Ceph you can store a lot more than that. You can store key/value pairs inside an object, think berkelyDB or sql where each object is a logical container. It supports atomic transaction so you can do things like atomic compare-and-swap. Update the bytes and the keys/values in an atomic fashion and it will be consistently distributed and replicated across a cluster in a safe way. There is snapshotting that will give you per-directory snapshots, and inter-client communication for locking and whatnot. The really exciting part about this is the ability to implement your own functionality on the OSD.....
  12. These embedded object classes allow you to send an object method call to the cluster and it will actually perform that computation without having to pull the data over the network. The downside to using these object classes is the injection of new functionality into the system. A compiled C++ plugin has to be delivered and dynamically loaded into each OSD process. This becomes more complicated if a cluster is composed of multiple architecture targets, and makes it difficult to update functionality on the fly. One approach to addressing these problems is to embed a language run-time within the OSD. Noah Watkins, one of our engineers tackled this with some Lua bindings which are available.
  13. One of the more contentious assertions that Sage likes to make is that as we move towards exascale computing and beyond we'll need to transcend or replace POSIX. The heirarchical model just doesn't scale well beyond a certain level. Future models are going to have to start blurring the line between compute and storage and recognizing when data is local to perform operations vs when you need to gather data from multiple sources and gather data for an operation. And finally fault tolerance needs to become a first-class property of these architectures. As we push the scale of our existing architectures, building things like burst buffers to deal with huge checkpoints across millions of cores it just doesn't make a whole lot of sense.
  14. Having said all that, there are too many things (both people and code) built using POSIX mentality to ditch it any time soon. CephFS is designed to provide that POSIX layer on top of RADOS. [read slide] Now, as we've said there is certainly some work to be done on CephFS, but I want to share a bit about how it works since it (and similar thinking) will play a big part of Ceph's HPC and Big Data applications going forward.
  15. CephFS adds a metadata server (or MDS) to the list of node types in your Ceph cluster. Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  16. There are multiple MDSs!
  17. So how do you have one tree and multiple servers?
  18. If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  19. When the second one comes along, it will intelligently partition the work by taking a subtree.
  20. When the third MDS arrives, it will attempt to split the tree again.
  21. Same with the fourth.
  22. A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”. This is done as a periodic load balance exchange. The transfer just ships the cache contents between MDS and lets the clients continue transparently.
  23. CephFS has some neat features that you don't find in most file systems. Because they built the filesystem namespace from the ground up they were able to build these features into the infrastructure. One of these features is recursive accounting. The MDSs keep track of directory stats for every dir in the file system. For instance, when you do an 'ls -al' the file size is actually the total number of bytes stored in that directory recursively in the system. The same thing you can get from a 'du' but in realtime.
  24. [Provides snapshots] The motivation here is once you start talking about petabytes and exabytes it doesn't make much sense to try to snapshot the entire tree. You need to be able to snapshot different directories and different data sets. You can add and remove snapshots for any directory with standard bash-type commands.
  25. Also, next Ceph developer summit coming soon to plan for the Emperor release. Would love to see some blueprints submitted for CephFS work.