SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Downloaden Sie, um offline zu lesen
managing a distributed
storage system at scale
     sage weil – inktank
      LISA – 12.12.12
outline
●   the scale-out world
●   what is it, what it's for, how it works
●   how you can use it
    ●   librados
    ●   radosgw
    ●   RBD, the ceph block device
    ●   distributed file system
●   how do you deploy and manage it
●   roadmap
●   why we do this, who we are
why should you care about another
        storage system?
the old reality
●   direct-attached storage
    ●   raw disks
    ●   RAID controller
●   network-attached storage
    ●   a few large arrays
    ●   NFS, CIFS, maybe iSCSI
●   SAN
    ●   expensive enterprise array
●   scalable capacity and performance, to a point
●   poor management scalability
new user demands
●   “cloud”
    ●   many disk volumes, with automated provisioning
    ●   larger data sets
        –   RESTful object storage (e.g., S3, Swift)
        –   NoSQL distributed key/value stores (Cassandra, Riak)
●   “big data”
    ●   distributed processing
    ●   “unstructured data”
●   ...and the more traditional workloads
requirements
●   diverse use-cases
    ●   object storage
    ●   block devices (for VMs) with snapshots, cloning
    ●   shared file system with POSIX, coherent caches
    ●   structured data... files, block devices, or objects?
●   scale
    ●   terabytes, petabytes, exabytes
    ●   heterogeneous hardware
    ●   reliability and fault tolerance
cost
●   near-linear function of size or performance
●   incremental expansion
    ●   no fork-lift upgrades
●   no vendor lock-in
    ●   choice of hardware
    ●   choice of software
●   open
time
●   ease of administration
●   no manual data migration, load balancing
●   painless scaling
    ●   expansion and contraction
    ●   seamless migration


●   devops friendly
being a good devops citizen
●   seamless integration with infrastructure tools
    ●   Chef, Juju, Puppet, Crowbar, or home-grown
    ●   collectd, statsd, graphite, nagios
    ●   OpenStack, CloudStack
●
    simple processes for
    ●   expanding or contracting the storage cluster
    ●
        responding to failure scenarios: initial healing, or mop-up
●   evolution of “Unix philosophy”
    ●
        everything is just daemon
    ●   CLI, REST APIs, JSON
    ●
        solve one problem, and solve it well
what is ceph?
unified storage system
●   objects
    ●   native API
    ●   RESTful
●   block
    ●   thin provisioning, snapshots, cloning
●   file
    ●   strong consistency, snapshots
APP
      APP                      APP
                               APP                  HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                         RADOSGW
                         RADOSGW                RBD
                                                RBD                       CEPH FS
                                                                          CEPH FS
 LIBRADOS
  LIBRADOS
                          A bucket-based
                           A bucket-based         A reliable and
                                                   A reliable and          A POSIX-compliant
                                                                            A POSIX-compliant
   A library allowing
    A library allowing    REST gateway,
                           REST gateway,          fully-distributed
                                                   fully-distributed       distributed file
                                                                            distributed file
   apps to directly
    apps to directly      compatible with S3
                           compatible with S3     block device, with aa
                                                   block device, with      system, with aa
                                                                            system, with
   access RADOS,
    access RADOS,         and Swift
                           and Swift              Linux kernel client
                                                   Linux kernel client     Linux kernel client
                                                                            Linux kernel client
   with support for
    with support for                              and aaQEMU/KVM
                                                   and QEMU/KVM            and support for
                                                                            and support for
   C, C++, Java,
    C, C++, Java,                                 driver
                                                   driver                  FUSE
                                                                            FUSE
   Python, Ruby,
    Python, Ruby,
   and PHP
    and PHP




RADOS ––the Ceph object store
RADOS the Ceph object store

 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
distributed storage system
●   data center scale
    ●   10s to 10,000s of machines
    ●   terabytes to exabytes
●   fault tolerant
    ●   no single point of failure
    ●   commodity hardware
●   self-managing, self-healing
ceph object model
●   pools
    ●   1s to 100s
    ●   independent object namespaces or collections
    ●   replication level, placement policy
●   objects
    ●   bazillions
    ●   blob of data (bytes to gigabytes)
    ●   attributes (e.g., “version=12”; bytes to kilobytes)
    ●   key/value bundle (bytes to gigabytes)
why start with objects?
●   more useful than (disk) blocks
    ●   names in a simple flat namespace
    ●   variable size
    ●   simple API with rich semantics
●   more scalable than files
    ●   no hard-to-distribute hierarchy
    ●   update semantics do not span objects
    ●   workload is trivially parallel
How do you design
a storage system that scales?
DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK

HUMAN
HUMAN   COMPUTER
        COMPUTER   DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK

                   DISK
                   DISK
DISK
                   DISK

                   DISK
                   DISK
HUMAN
HUMAN
                   DISK
                   DISK

HUMAN
HUMAN   COMPUTER
        COMPUTER   DISK
                   DISK

                   DISK
                   DISK
HUMAN
HUMAN
                   DISK
                   DISK

                   DISK
                   DISK
HUMAN
    HUMAN      HUMAN
                HUMAN

                        HUMAN
                         HUMAN
 HUMAN
  HUMAN                                                       DISK
                                                              DISK
               HUMAN
                HUMAN
HUMAN
 HUMAN                                                        DISK
                                                              DISK
 HUMAN
  HUMAN        HUMAN
                HUMAN
                                                              DISK
                                                              DISK
                                                              DISK
                                                              DISK
     HUMAN
      HUMAN
                                                              DISK
                                                              DISK
            HUMAN
             HUMAN
HUMAN
 HUMAN
                                                              DISK
                                                              DISK
                                    (COMPUTER))
                                     (COMPUTER
             HUMAN
              HUMAN
                                                              DISK
                                                              DISK
  HUMAN            HUMAN
                    HUMAN
   HUMAN                                                      DISK
                                                              DISK
             HUMAN
              HUMAN
 HUMAN
  HUMAN                                                       DISK
                                                              DISK
              HUMAN
               HUMAN                                          DISK
                                                              DISK
   HUMAN
    HUMAN                                                     DISK
                                                              DISK
                     HUMAN
                      HUMAN
     HUMAN
      HUMAN                                                   DISK
                                                              DISK
                     HUMAN
                      HUMAN

          HUMAN
           HUMAN
                                 (actually more like this…)
COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
HUMAN
HUMAN
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
HUMAN
HUMAN
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
HUMAN
HUMAN
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
        COMPUTER
        COMPUTER   DISK
                   DISK
OSD    OSD    OSD    OSD    OSD




 FS     FS     FS    FS     FS     btrfs
                                   xfs
                                   ext4
DISK   DISK   DISK   DISK   DISK




  M            M             M
Monitors
        maintain cluster membership and state


M
    ●


    ●
        provide consensus for distributed
        decision-making
    ●
        small, odd number
    ●
        these do not served stored objects to clients


        Object Storage Daemons (OSDs)
    ●
        10s to 10000s in a cluster
    ●
        one per disk, SSD, or RAID group
        ●
            hardware agnostic
    ●
        serve stored objects to clients
    ●
        intelligently peer to perform replication and
        recovery tasks
HUMAN




        M




M           M
data distribution
●   all objects are replicated N times
●   objects are automatically placed, balanced, migrated
    in a dynamic cluster
●   must consider physical infrastructure
    ●   ceph-osds on hosts in racks in rows in data centers

●   three approaches
    ●   pick a spot; remember where you put it
    ●   pick a spot; write down where you put it
    ●   calculate where to put it, where to find it
CRUSH
●
    pseudo-random placement algorithm
    ●   fast calculation, no lookup
    ●   repeatable, deterministic
●
    statistically uniform distribution
●
    stable mapping
    ●
        limited data migration on change
●
    rule-based configuration
    ●
        infrastructure topology aware
    ●   adjustable replication
    ●
        allows weighting
hash(object name) % num pg

10
 10   10
       10   01
             01   01
                   01   10
                         10   10
                               10    01
                                      01   11
                                            11   01
                                                  01   10
                                                        10




                                    CRUSH(pg, cluster state, policy)
10
 10   10
       10   01
             01   01
                   01   10
                         10   10
                               10   01
                                     01   11
                                           11   01
                                                 01   10
                                                       10
RADOS
●   monitors publish osd map that describes cluster state
    ●   ceph-osd node status (up/down, weight, IP)               M
    ●   CRUSH function specifying desired data distribution
●   object storage daemons (OSDs)
    ●   safely replicate and store object
    ●   migrate data as the cluster changes over time
    ●   coordinate based on shared view of reality
●   decentralized, distributed approach allows
    ●   massive scales (10,000s of servers or more)
    ●   the illusion of a single copy with consistent behavior
CLIENT
CLIENT

         ??
CLIENT

         ??
APP
      APP                     APP
                              APP                   HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                        RADOSGW
                        RADOSGW                RBD
                                               RBD                        CEPH FS
                                                                          CEPH FS
LIBRADOS
                         A bucket-based
                          A bucket-based          A reliable and
                                                   A reliable and          A POSIX-compliant
                                                                            A POSIX-compliant
   A library allowing    REST gateway,
                          REST gateway,           fully-distributed
                                                   fully-distributed       distributed file
                                                                            distributed file
   apps to directly      compatible with S3
                          compatible with S3      block device, with aa
                                                   block device, with      system, with aa
                                                                            system, with
   access RADOS,         and Swift
                          and Swift               Linux kernel client
                                                   Linux kernel client     Linux kernel client
                                                                            Linux kernel client
   with support for                               and aaQEMU/KVM
                                                   and QEMU/KVM            and support for
                                                                            and support for
   C, C++, Java,                                  driver
                                                   driver                  FUSE
                                                                            FUSE
   Python, Ruby,
   and PHP




RADOS
RADOS

 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
APP
     APP
    LIBRADOS
     LIBRADOS

                native




    M
    M
M
M               M
                M
librados



L
    ●
        direct access to RADOS from
        applications
    ●
        C, C++, Python, PHP, Java,
        Erlang
    ●
        direct access to storage nodes
    ●
        no HTTP overhead
rich librados API
●   atomic single-object transactions
    ●   update data, attr together
    ●   atomic compare-and-swap
●   efficient key/value storage inside an object
●   object-granularity snapshot infrastructure
●   embed code in ceph-osd daemon via plugin API
●   inter-client communication via object
APP
      APP                      APP
                               APP                  HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                         RADOSGW               RBD
                                               RBD                        CEPH FS
                                                                          CEPH FS
LIBRADOS
 LIBRADOS
                          A bucket-based          A reliable and
                                                   A reliable and          A POSIX-compliant
                                                                            A POSIX-compliant
   A library allowing
    A library allowing    REST gateway,           fully-distributed
                                                   fully-distributed       distributed file
                                                                            distributed file
   apps to directly
    apps to directly      compatible with S3      block device, with aa
                                                   block device, with      system, with aa
                                                                            system, with
   access RADOS,
    access RADOS,         and Swift               Linux kernel client
                                                   Linux kernel client     Linux kernel client
                                                                            Linux kernel client
   with support for
    with support for                              and aaQEMU/KVM
                                                   and QEMU/KVM            and support for
                                                                            and support for
   C, C++, Java,
    C, C++, Java,                                 driver
                                                   driver                  FUSE
                                                                            FUSE
   Python, Ruby,
    Python, Ruby,
   and PHP
    and PHP




RADOS
RADOS

 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
APP
  APP                 APP
                      APP
                                  REST




RADOSGW
RADOSGW           RADOSGW
                  RADOSGW
  LIBRADOS
   LIBRADOS           LIBRADOS
                       LIBRADOS


                                         native




              M
              M
        M
        M         M
                  M
RADOS Gateway
●
    REST-based object storage proxy
●
    uses RADOS to store objects
●
    API supports buckets, accounting
●
    usage accounting for billing
    purposes
●
    compatible with S3, Swift APIs
APP
      APP                      APP
                               APP                  HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                         RADOSGW
                         RADOSGW                RBD                      CEPH FS
                                                                         CEPH FS
LIBRADOS
 LIBRADOS
                          A bucket-based
                           A bucket-based         A reliable and           A POSIX-compliant
                                                                            A POSIX-compliant
   A library allowing
    A library allowing    REST gateway,
                           REST gateway,          fully-distributed        distributed file
                                                                            distributed file
   apps to directly
    apps to directly      compatible with S3
                           compatible with S3     block device, with a     system, with aa
                                                                            system, with
   access RADOS,
    access RADOS,         and Swift
                           and Swift              Linux kernel client      Linux kernel client
                                                                            Linux kernel client
   with support for
    with support for                              and a QEMU/KVM           and support for
                                                                            and support for
   C, C++, Java,
    C, C++, Java,                                 driver                   FUSE
                                                                            FUSE
   Python, Ruby,
    Python, Ruby,
   and PHP
    and PHP




RADOS
RADOS

 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
COMPUTER
COMPUTER          COMPUTER
                  COMPUTER   DISK
                             DISK
           DISK
           DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
                  COMPUTER
                  COMPUTER   DISK
                             DISK
COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
VM
VM   COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
VM
VM   COMPUTER   DISK
     COMPUTER   DISK
     COMPUTER
     COMPUTER   DISK
                DISK
VM
VM
     COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
     COMPUTER
     COMPUTER   DISK
                DISK
VM
               VM




VIRTUALIZATION CONTAINER (KVM)
VIRTUALIZATION CONTAINER (KVM)
               LIBRBD
                LIBRBD
            LIBRADOS
             LIBRADOS




           M
           M
      M
      M                  M
                         M
CONTAINER
CONTAINER         VM
                  VM       CONTAINER
                           CONTAINER
   LIBRBD
    LIBRBD                    LIBRBD
                               LIBRBD
  LIBRADOS
   LIBRADOS                  LIBRADOS
                              LIBRADOS




                  M
                  M
              M
              M        M
                       M
HOST
        HOST
    KRBD (KERNEL MODULE)
     KRBD (KERNEL MODULE)
         LIBRADOS
          LIBRADOS




       M
       M
M
M                       M
                        M
RADOS Block Device
●
    storage of disk images in RADOS
●
    decouple VM from host
●
    images striped across entire
    cluster (pool)
●
    snapshots
●
    copy-on-write clones
●
    support in
    ●
        Qemu/KVM
    ●
        mainline Linux kernel (2.6.39+)
    ●
        OpenStack, CloudStack
APP
      APP                      APP
                               APP                  HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                         RADOSGW
                         RADOSGW                RBD
                                                RBD                       CEPH FS
LIBRADOS
 LIBRADOS
                          A bucket-based
                           A bucket-based         A reliable and
                                                   A reliable and          A POSIX-compliant
   A library allowing
    A library allowing    REST gateway,
                           REST gateway,          fully-distributed
                                                   fully-distributed       distributed file
   apps to directly
    apps to directly      compatible with S3
                           compatible with S3     block device, with aa
                                                   block device, with      system, with a
   access RADOS,
    access RADOS,         and Swift
                           and Swift              Linux kernel client
                                                   Linux kernel client     Linux kernel client
   with support for
    with support for                              and aaQEMU/KVM
                                                   and QEMU/KVM            and support for
   C, C++, Java,
    C, C++, Java,                                 driver
                                                   driver                  FUSE
   Python, Ruby,
    Python, Ruby,
   and PHP
    and PHP




RADOS
RADOS

 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
CLIENT
               CLIENT



metadata           01
                    01   data
                   10
                    10




               M
               M
           M
           M             M
                         M
M
    M
M
M       M
        M
Metadata Server (MDS)
●
    manages metadata for POSIX
    shared file system
    ●
        directory hierarchy
    ●
        file metadata (size, owner, timestamps)
●
    stores metadata in RADOS
●
    does not serve file data to clients
●
    only required for the shared file
    system
one tree




three metadata servers


                               ??
DYNAMIC SUBTREE PARTITIONING
recursive accounting
●   ceph-mds tracks recursive directory stats
    ●   file sizes
    ●   file and directory counts
    ●   modification time
●
    virtual xattrs present full stats
●
    efficient
        $ ls ­alSh | head
        total 0
        drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
        drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
        drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
        drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
        drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
        drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
        drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
        drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
        drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●   volume or subvolume snapshots unusable at petabyte scale
    ●   snapshot arbitrary subdirectories
●   simple interface
    ●   hidden '.snap' directory
    ●
        no special tools

        $ mkdir foo/.snap/one      # create snapshot
        $ ls foo/.snap
        one
        $ ls foo/bar/.snap
        _one_1099511627776         # parent's snap name is mangled
        $ rm foo/myfile
        $ ls -F foo
        bar/
        $ ls -F foo/.snap/one
        myfile bar/
        $ rmdir foo/.snap/one      # remove snapshot
multiple protocols, implementations
●   Linux kernel client
    ●   mount -t ceph 1.2.3.4:/ /mnt
                                        NFS            SMB/CIFS
    ●   export (NFS), Samba (CIFS)
●   ceph-fuse                           Ganesha            Samba
                                         libcephfs          libcephfs
●   libcephfs.so
    ●   your app                        Hadoop             your app
                                         libcephfs          libcephfs
    ●   Samba (CIFS)
    ●   Ganesha (NFS)                          ceph-fuse
                                       ceph        fuse
    ●   Hadoop (map/reduce)                       kernel
How do you deploy and manage all this stuff?
installation
●   work closely with distros
    ●   packages in Debian, Fedora, Ubuntu, OpenSUSE
●   build releases for all distros, recent releases
    ●   e.g., point to our deb src and apt-get install
●   allow mixed-version clusters
    ●   upgrade each component individually, each host or rack one-by-one
    ●   protocol feature bits, safe data type encoding
●   facilitate testing, bleeding edge
    ●   automatic build system generates packages for all branches in git
    ●   each to test new code, push hot-fixes
configuration
●   minimize local configuration
    ●   logging, local data paths, tuning options
    ●   cluster state is managed centrally by monitors
●   specify option via
    ●   config file (global or local)
    ●   command line
    ●   adjusted for running daemon
●   flexible configuration management options
    ●   global synced/copied config file
    ●   generated by management tools
         –   Chef, Juju, Crowbar
●   be flexible
provisioning
●   embrace dynamic nature of the cluster
    ●   disks, hosts, rack may come online at any time
    ●   anything may fail at any time
●   simple scriptable sequences
        'ceph osd create <uuid>'    → allocate osd id
        'ceph-osd --mkfs -i <id>'   → initialize local data dir
        'ceph osd crush …'          → add to crush map
●   identify minimal amount of central coordination
    ●   monitor cluster membership/quorum
●   provide hooks for external tools to do the rest
old school HA
●   deploy a pair of servers
●   heartbeat
●   move a floating IP between then
●   clients contact same “server”, which floats

●   cumbersome to configure, deploy
●   the “client/server” model doesn't scale out
make client, protocol cluster-aware
●   clients learn topology of the distributed cluster
    ●   all OSDs, current ips/ports, data distribution map
●   clients talk to the server they want
    ●   servers scale out, client traffic distributes too
●   servers dynamically register with the cluster
    ●   when a daemon starts, it updates its address in the map
    ●   other daemons and clients learn map updates via gossip
●   servers manage heartbeat, replication
    ●   no fragile configuration
    ●   no fail-over pairs
ceph disk management
●   label disks
    ●   GPT partition type (fixed uuid)
●   udev
    ●   generate event when disk is added, on boot
●   'ceph-disk-activate /dev/sdb'
    ●   mount the disk in the appropriate location (/var/lib/ceph/osd/NNN)
    ●   possibly adjust cluster metadata about disk location (host, rack)
●   upstart, sysvinit, …
    ●   start the daemon
    ●   daemon “joins” the cluster, brings itself online
●   no manual per-node configuration
distributed system health
●
    ceph-mon collects, aggregates basic system state
    ●   daemon up/down, current IPs
    ●   disk utilization
●
    simple hooks query system health
    ●   'ceph health [detail]'
    ●   trivial plugins for nagios, etc.
●
    daemons instrumentation
    ●   query state of running daemon
    ●   use external tools to aggregate
         –   collectd, graphite
         –   statsd
many paths
●   do it yourself
    ●   use ceph interfaces directly
●   Chef
    ●   ceph cookbooks, DreamHost cookbooks
●   Crowbar
●   Juju
●   pre-packaged solutions
    ●   Piston AirFrame, Dell OpenStack-Powered Cloud Solution,
        Suse Cloud, etc.
Project status and roadmap
APP
      APP                      APP
                               APP                  HOST/VM
                                                    HOST/VM                    CLIENT
                                                                               CLIENT



                         RADOSGW
                         RADOSGW                RBD
                                                RBD                       CEPH FS
                                                                          CEPH FS
LIBRADOS
 LIBRADOS
                          A bucket-based
                           A bucket-based         A reliable and
                                                   A reliable and          A POSIX-compliant
                                                                            A POSIX-compliant
   A library allowing
    A library allowing    REST gateway,
                           REST gateway,          fully-distributed
                                                   fully-distributed       distributed file
                                                                            distributed file
   apps to directly
    apps to directly      compatible with S3
                           compatible with S3     block device, with aa
                                                   block device, with      system, with aa
                                                                            system, with
   access RADOS,
    access RADOS,         and Swift
                           and Swift              Linux kernel client
                                                   Linux kernel client     Linux kernel client
                                                                            Linux kernel client
   with support for
    with support for                              and aaQEMU/KVM
                                                   and QEMU/KVM            and support for
                                                                            and support for
   C, C++, Java,
    C, C++, Java,                                 driver
                                                   driver                  FUSE
                                                                            FUSE
   Python, Ruby,
    Python, Ruby,
   and PHP
    and PHP                AWESOME                 AWESOME
                                                                              NEARLY
 AWESOME                                                                     AWESOME


RADOS
RADOS                                    AWESOME
 A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
  A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
 intelligent storage nodes
  intelligent storage nodes
current status
●   argonaut stable release v0.48
    ●   rados, RBD, radosgw
●   bobtail stable release v0.56
    ●   RBD cloning
    ●   improved performance, scaling, failure behavior
    ●   radosgw API, performance improvements
    ●   release in 1-2 weeks
cuttlefish roadmap
●   file system
    ●   pivot in engineering focus
    ●   CIFS (Samba), NFS (Ganesha), Hadoop
●   RBD
    ●   Xen integration, iSCSI
●   radosgw
    ●   multi-side federation, disaster recovery
●   RADOS
    ●   geo-replication, disaster recovery
    ●   ongoing performance improvements
why we do this
●   limited options for scalable open source storage
●   proprietary solutions
    ●   expensive
    ●   don't scale (well or out)
    ●   marry hardware and software
●   users hungry for alternatives
    ●   scalability
    ●   cost
    ●   features
    ●   open, interoperable
licensing
    <yawn>
●   promote adoption
●   enable community development
●   prevent ceph from becoming proprietary
●   allow organic commercialization
LGPLv2
●   “copyleft”
    ●   free distribution
    ●   allow derivative works
    ●   changes you distribute/sell must be shared
●   ok to link to proprietary code
    ●   allow proprietary products to incude and build on
        ceph
    ●   does not allow proprietary derivatives of ceph
fragmented copyright
●   we do not require copyright assignment from
    contributors
    ●   no single person or entity owns all of ceph
    ●   no single entity can make ceph proprietary
●   strong community
    ●   many players make ceph a safe technology bet
    ●   project can outlive any single business
why it's important
●   ceph is an ingredient
    ●   we need to play nice in a larger ecosystem
    ●   community will be key to success
●   truly open source solutions are disruptive
    ●   frictionless integration with projects, platforms, tools
    ●   freedom to innovate on protocols
    ●   leverage community testing, development resources
    ●   open collaboration is efficient way to build technology
who we are
●   Ceph created at UC Santa Cruz (2004-2007)
●   developed by DreamHost (2008-2011)
●   supported by Inktank (2012)
    ●   Los Angeles, Sunnyvale, San Francisco, remote
●   growing user and developer community
    ●   Linux distros, users, cloud stacks, SIs, OEMs


                       http://ceph.com/
thanks




sage weil
sage@inktank.com   http://github.com/ceph
@liewegas          http://ceph.com/
why we like btrfs
●   pervasive checksumming
●   snapshots, copy-on-write
●   efficient metadata (xattrs)
●   inline data for small files
●   transparent compression
●   integrated volume management
    ●   software RAID, mirroring, error recovery
    ●   SSD-aware
●   online fsck
●   active development community

Weitere ähnliche Inhalte

Ähnlich wie Ceph LISA'12 Presentation

Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012Ceph Community
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondOpenStack Foundation
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Community
 
Storing VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfStoring VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfOpenStack Foundation
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turkbuildacloud
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSCeph Community
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With CephCeph Community
 
NAVER Ceph Storage on ssd for Container
NAVER Ceph Storage on ssd for ContainerNAVER Ceph Storage on ssd for Container
NAVER Ceph Storage on ssd for ContainerJangseon Ryu
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewMarcel Hergaarden
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017 Karan Singh
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Storage with Ceph (OSDC 2013)
Storage with Ceph (OSDC 2013)Storage with Ceph (OSDC 2013)
Storage with Ceph (OSDC 2013)hastexo
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with cephIan Colle
 
Inside Triton, July 2015
Inside Triton, July 2015Inside Triton, July 2015
Inside Triton, July 2015Casey Bisson
 

Ähnlich wie Ceph LISA'12 Presentation (20)

Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
Ceph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFSCeph Day NYC: The Future of CephFS
Ceph Day NYC: The Future of CephFS
 
Storing VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdfStoring VMs with Cinder and Ceph RBD.pdf
Storing VMs with Cinder and Ceph RBD.pdf
 
Ceph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross TurkCeph Intro and Architectural Overview by Ross Turk
Ceph Intro and Architectural Overview by Ross Turk
 
London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Webinar - Getting Started With Ceph
Webinar - Getting Started With CephWebinar - Getting Started With Ceph
Webinar - Getting Started With Ceph
 
NAVER Ceph Storage on ssd for Container
NAVER Ceph Storage on ssd for ContainerNAVER Ceph Storage on ssd for Container
NAVER Ceph Storage on ssd for Container
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Storage with Ceph (OSDC 2013)
Storage with Ceph (OSDC 2013)Storage with Ceph (OSDC 2013)
Storage with Ceph (OSDC 2013)
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
librados
libradoslibrados
librados
 
librados
libradoslibrados
librados
 
Inside Triton, July 2015
Inside Triton, July 2015Inside Triton, July 2015
Inside Triton, July 2015
 

Kürzlich hochgeladen

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Ceph LISA'12 Presentation

  • 1. managing a distributed storage system at scale sage weil – inktank LISA – 12.12.12
  • 2. outline ● the scale-out world ● what is it, what it's for, how it works ● how you can use it ● librados ● radosgw ● RBD, the ceph block device ● distributed file system ● how do you deploy and manage it ● roadmap ● why we do this, who we are
  • 3. why should you care about another storage system?
  • 4. the old reality ● direct-attached storage ● raw disks ● RAID controller ● network-attached storage ● a few large arrays ● NFS, CIFS, maybe iSCSI ● SAN ● expensive enterprise array ● scalable capacity and performance, to a point ● poor management scalability
  • 5. new user demands ● “cloud” ● many disk volumes, with automated provisioning ● larger data sets – RESTful object storage (e.g., S3, Swift) – NoSQL distributed key/value stores (Cassandra, Riak) ● “big data” ● distributed processing ● “unstructured data” ● ...and the more traditional workloads
  • 6. requirements ● diverse use-cases ● object storage ● block devices (for VMs) with snapshots, cloning ● shared file system with POSIX, coherent caches ● structured data... files, block devices, or objects? ● scale ● terabytes, petabytes, exabytes ● heterogeneous hardware ● reliability and fault tolerance
  • 7. cost ● near-linear function of size or performance ● incremental expansion ● no fork-lift upgrades ● no vendor lock-in ● choice of hardware ● choice of software ● open
  • 8. time ● ease of administration ● no manual data migration, load balancing ● painless scaling ● expansion and contraction ● seamless migration ● devops friendly
  • 9. being a good devops citizen ● seamless integration with infrastructure tools ● Chef, Juju, Puppet, Crowbar, or home-grown ● collectd, statsd, graphite, nagios ● OpenStack, CloudStack ● simple processes for ● expanding or contracting the storage cluster ● responding to failure scenarios: initial healing, or mop-up ● evolution of “Unix philosophy” ● everything is just daemon ● CLI, REST APIs, JSON ● solve one problem, and solve it well
  • 11. unified storage system ● objects ● native API ● RESTful ● block ● thin provisioning, snapshots, cloning ● file ● strong consistency, snapshots
  • 12. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and A reliable and A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, fully-distributed fully-distributed distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 block device, with aa block device, with system, with aa system, with access RADOS, access RADOS, and Swift and Swift Linux kernel client Linux kernel client Linux kernel client Linux kernel client with support for with support for and aaQEMU/KVM and QEMU/KVM and support for and support for C, C++, Java, C, C++, Java, driver driver FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS ––the Ceph object store RADOS the Ceph object store A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 13. distributed storage system ● data center scale ● 10s to 10,000s of machines ● terabytes to exabytes ● fault tolerant ● no single point of failure ● commodity hardware ● self-managing, self-healing
  • 14. ceph object model ● pools ● 1s to 100s ● independent object namespaces or collections ● replication level, placement policy ● objects ● bazillions ● blob of data (bytes to gigabytes) ● attributes (e.g., “version=12”; bytes to kilobytes) ● key/value bundle (bytes to gigabytes)
  • 15. why start with objects? ● more useful than (disk) blocks ● names in a simple flat namespace ● variable size ● simple API with rich semantics ● more scalable than files ● no hard-to-distribute hierarchy ● update semantics do not span objects ● workload is trivially parallel
  • 16. How do you design a storage system that scales?
  • 17. DISK DISK DISK DISK DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK DISK DISK DISK DISK
  • 18. DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK DISK DISK HUMAN HUMAN DISK DISK DISK DISK
  • 19. HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK (COMPUTER)) (COMPUTER HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN DISK DISK HUMAN HUMAN HUMAN HUMAN (actually more like this…)
  • 20. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK HUMAN HUMAN COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 21. OSD OSD OSD OSD OSD FS FS FS FS FS btrfs xfs ext4 DISK DISK DISK DISK DISK M M M
  • 22. Monitors maintain cluster membership and state M ● ● provide consensus for distributed decision-making ● small, odd number ● these do not served stored objects to clients Object Storage Daemons (OSDs) ● 10s to 10000s in a cluster ● one per disk, SSD, or RAID group ● hardware agnostic ● serve stored objects to clients ● intelligently peer to perform replication and recovery tasks
  • 23. HUMAN M M M
  • 24. data distribution ● all objects are replicated N times ● objects are automatically placed, balanced, migrated in a dynamic cluster ● must consider physical infrastructure ● ceph-osds on hosts in racks in rows in data centers ● three approaches ● pick a spot; remember where you put it ● pick a spot; write down where you put it ● calculate where to put it, where to find it
  • 25. CRUSH ● pseudo-random placement algorithm ● fast calculation, no lookup ● repeatable, deterministic ● statistically uniform distribution ● stable mapping ● limited data migration on change ● rule-based configuration ● infrastructure topology aware ● adjustable replication ● allows weighting
  • 26. hash(object name) % num pg 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10 CRUSH(pg, cluster state, policy)
  • 27. 10 10 10 10 01 01 01 01 10 10 10 10 01 01 11 11 01 01 10 10
  • 28. RADOS ● monitors publish osd map that describes cluster state ● ceph-osd node status (up/down, weight, IP) M ● CRUSH function specifying desired data distribution ● object storage daemons (OSDs) ● safely replicate and store object ● migrate data as the cluster changes over time ● coordinate based on shared view of reality ● decentralized, distributed approach allows ● massive scales (10,000s of servers or more) ● the illusion of a single copy with consistent behavior
  • 30.
  • 31.
  • 32. CLIENT ??
  • 33. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS A bucket-based A bucket-based A reliable and A reliable and A POSIX-compliant A POSIX-compliant A library allowing REST gateway, REST gateway, fully-distributed fully-distributed distributed file distributed file apps to directly compatible with S3 compatible with S3 block device, with aa block device, with system, with aa system, with access RADOS, and Swift and Swift Linux kernel client Linux kernel client Linux kernel client Linux kernel client with support for and aaQEMU/KVM and QEMU/KVM and support for and support for C, C++, Java, driver driver FUSE FUSE Python, Ruby, and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 34. APP APP LIBRADOS LIBRADOS native M M M M M M
  • 35. librados L ● direct access to RADOS from applications ● C, C++, Python, PHP, Java, Erlang ● direct access to storage nodes ● no HTTP overhead
  • 36. rich librados API ● atomic single-object transactions ● update data, attr together ● atomic compare-and-swap ● efficient key/value storage inside an object ● object-granularity snapshot infrastructure ● embed code in ceph-osd daemon via plugin API ● inter-client communication via object
  • 37. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A reliable and A reliable and A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, fully-distributed fully-distributed distributed file distributed file apps to directly apps to directly compatible with S3 block device, with aa block device, with system, with aa system, with access RADOS, access RADOS, and Swift Linux kernel client Linux kernel client Linux kernel client Linux kernel client with support for with support for and aaQEMU/KVM and QEMU/KVM and support for and support for C, C++, Java, C, C++, Java, driver driver FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 38. APP APP APP APP REST RADOSGW RADOSGW RADOSGW RADOSGW LIBRADOS LIBRADOS LIBRADOS LIBRADOS native M M M M M M
  • 39. RADOS Gateway ● REST-based object storage proxy ● uses RADOS to store objects ● API supports buckets, accounting ● usage accounting for billing purposes ● compatible with S3, Swift APIs
  • 40. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, fully-distributed distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 block device, with a system, with aa system, with access RADOS, access RADOS, and Swift and Swift Linux kernel client Linux kernel client Linux kernel client with support for with support for and a QEMU/KVM and support for and support for C, C++, Java, C, C++, Java, driver FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 41. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER COMPUTER COMPUTER DISK DISK DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 42. COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER DISK COMPUTER DISK COMPUTER COMPUTER DISK DISK VM VM COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK COMPUTER COMPUTER DISK DISK
  • 43. VM VM VIRTUALIZATION CONTAINER (KVM) VIRTUALIZATION CONTAINER (KVM) LIBRBD LIBRBD LIBRADOS LIBRADOS M M M M M M
  • 44. CONTAINER CONTAINER VM VM CONTAINER CONTAINER LIBRBD LIBRBD LIBRBD LIBRBD LIBRADOS LIBRADOS LIBRADOS LIBRADOS M M M M M M
  • 45. HOST HOST KRBD (KERNEL MODULE) KRBD (KERNEL MODULE) LIBRADOS LIBRADOS M M M M M M
  • 46. RADOS Block Device ● storage of disk images in RADOS ● decouple VM from host ● images striped across entire cluster (pool) ● snapshots ● copy-on-write clones ● support in ● Qemu/KVM ● mainline Linux kernel (2.6.39+) ● OpenStack, CloudStack
  • 47. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and A reliable and A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, fully-distributed fully-distributed distributed file apps to directly apps to directly compatible with S3 compatible with S3 block device, with aa block device, with system, with a access RADOS, access RADOS, and Swift and Swift Linux kernel client Linux kernel client Linux kernel client with support for with support for and aaQEMU/KVM and QEMU/KVM and support for C, C++, Java, C, C++, Java, driver driver FUSE Python, Ruby, Python, Ruby, and PHP and PHP RADOS RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 48. CLIENT CLIENT metadata 01 01 data 10 10 M M M M M M
  • 49. M M M M M M
  • 50. Metadata Server (MDS) ● manages metadata for POSIX shared file system ● directory hierarchy ● file metadata (size, owner, timestamps) ● stores metadata in RADOS ● does not serve file data to clients ● only required for the shared file system
  • 52.
  • 53.
  • 54.
  • 55.
  • 57. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 58. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 59. multiple protocols, implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt NFS SMB/CIFS ● export (NFS), Samba (CIFS) ● ceph-fuse Ganesha Samba libcephfs libcephfs ● libcephfs.so ● your app Hadoop your app libcephfs libcephfs ● Samba (CIFS) ● Ganesha (NFS) ceph-fuse ceph fuse ● Hadoop (map/reduce) kernel
  • 60. How do you deploy and manage all this stuff?
  • 61. installation ● work closely with distros ● packages in Debian, Fedora, Ubuntu, OpenSUSE ● build releases for all distros, recent releases ● e.g., point to our deb src and apt-get install ● allow mixed-version clusters ● upgrade each component individually, each host or rack one-by-one ● protocol feature bits, safe data type encoding ● facilitate testing, bleeding edge ● automatic build system generates packages for all branches in git ● each to test new code, push hot-fixes
  • 62. configuration ● minimize local configuration ● logging, local data paths, tuning options ● cluster state is managed centrally by monitors ● specify option via ● config file (global or local) ● command line ● adjusted for running daemon ● flexible configuration management options ● global synced/copied config file ● generated by management tools – Chef, Juju, Crowbar ● be flexible
  • 63. provisioning ● embrace dynamic nature of the cluster ● disks, hosts, rack may come online at any time ● anything may fail at any time ● simple scriptable sequences 'ceph osd create <uuid>' → allocate osd id 'ceph-osd --mkfs -i <id>' → initialize local data dir 'ceph osd crush …' → add to crush map ● identify minimal amount of central coordination ● monitor cluster membership/quorum ● provide hooks for external tools to do the rest
  • 64. old school HA ● deploy a pair of servers ● heartbeat ● move a floating IP between then ● clients contact same “server”, which floats ● cumbersome to configure, deploy ● the “client/server” model doesn't scale out
  • 65. make client, protocol cluster-aware ● clients learn topology of the distributed cluster ● all OSDs, current ips/ports, data distribution map ● clients talk to the server they want ● servers scale out, client traffic distributes too ● servers dynamically register with the cluster ● when a daemon starts, it updates its address in the map ● other daemons and clients learn map updates via gossip ● servers manage heartbeat, replication ● no fragile configuration ● no fail-over pairs
  • 66. ceph disk management ● label disks ● GPT partition type (fixed uuid) ● udev ● generate event when disk is added, on boot ● 'ceph-disk-activate /dev/sdb' ● mount the disk in the appropriate location (/var/lib/ceph/osd/NNN) ● possibly adjust cluster metadata about disk location (host, rack) ● upstart, sysvinit, … ● start the daemon ● daemon “joins” the cluster, brings itself online ● no manual per-node configuration
  • 67. distributed system health ● ceph-mon collects, aggregates basic system state ● daemon up/down, current IPs ● disk utilization ● simple hooks query system health ● 'ceph health [detail]' ● trivial plugins for nagios, etc. ● daemons instrumentation ● query state of running daemon ● use external tools to aggregate – collectd, graphite – statsd
  • 68. many paths ● do it yourself ● use ceph interfaces directly ● Chef ● ceph cookbooks, DreamHost cookbooks ● Crowbar ● Juju ● pre-packaged solutions ● Piston AirFrame, Dell OpenStack-Powered Cloud Solution, Suse Cloud, etc.
  • 70. APP APP APP APP HOST/VM HOST/VM CLIENT CLIENT RADOSGW RADOSGW RBD RBD CEPH FS CEPH FS LIBRADOS LIBRADOS A bucket-based A bucket-based A reliable and A reliable and A POSIX-compliant A POSIX-compliant A library allowing A library allowing REST gateway, REST gateway, fully-distributed fully-distributed distributed file distributed file apps to directly apps to directly compatible with S3 compatible with S3 block device, with aa block device, with system, with aa system, with access RADOS, access RADOS, and Swift and Swift Linux kernel client Linux kernel client Linux kernel client Linux kernel client with support for with support for and aaQEMU/KVM and QEMU/KVM and support for and support for C, C++, Java, C, C++, Java, driver driver FUSE FUSE Python, Ruby, Python, Ruby, and PHP and PHP AWESOME AWESOME NEARLY AWESOME AWESOME RADOS RADOS AWESOME A reliable, autonomous, distributed object store comprised of self-healing, self-managing, A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes intelligent storage nodes
  • 71. current status ● argonaut stable release v0.48 ● rados, RBD, radosgw ● bobtail stable release v0.56 ● RBD cloning ● improved performance, scaling, failure behavior ● radosgw API, performance improvements ● release in 1-2 weeks
  • 72. cuttlefish roadmap ● file system ● pivot in engineering focus ● CIFS (Samba), NFS (Ganesha), Hadoop ● RBD ● Xen integration, iSCSI ● radosgw ● multi-side federation, disaster recovery ● RADOS ● geo-replication, disaster recovery ● ongoing performance improvements
  • 73. why we do this ● limited options for scalable open source storage ● proprietary solutions ● expensive ● don't scale (well or out) ● marry hardware and software ● users hungry for alternatives ● scalability ● cost ● features ● open, interoperable
  • 74. licensing <yawn> ● promote adoption ● enable community development ● prevent ceph from becoming proprietary ● allow organic commercialization
  • 75. LGPLv2 ● “copyleft” ● free distribution ● allow derivative works ● changes you distribute/sell must be shared ● ok to link to proprietary code ● allow proprietary products to incude and build on ceph ● does not allow proprietary derivatives of ceph
  • 76. fragmented copyright ● we do not require copyright assignment from contributors ● no single person or entity owns all of ceph ● no single entity can make ceph proprietary ● strong community ● many players make ceph a safe technology bet ● project can outlive any single business
  • 77. why it's important ● ceph is an ingredient ● we need to play nice in a larger ecosystem ● community will be key to success ● truly open source solutions are disruptive ● frictionless integration with projects, platforms, tools ● freedom to innovate on protocols ● leverage community testing, development resources ● open collaboration is efficient way to build technology
  • 78. who we are ● Ceph created at UC Santa Cruz (2004-2007) ● developed by DreamHost (2008-2011) ● supported by Inktank (2012) ● Los Angeles, Sunnyvale, San Francisco, remote ● growing user and developer community ● Linux distros, users, cloud stacks, SIs, OEMs http://ceph.com/
  • 79. thanks sage weil sage@inktank.com http://github.com/ceph @liewegas http://ceph.com/
  • 80.
  • 81. why we like btrfs ● pervasive checksumming ● snapshots, copy-on-write ● efficient metadata (xattrs) ● inline data for small files ● transparent compression ● integrated volume management ● software RAID, mirroring, error recovery ● SSD-aware ● online fsck ● active development community