SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
INFINISTORE

Scalable Open Source Storage
CEPH, LIO, OPENARCHIVE
OPENARCHIVE at a glance
   Start of development in 2000
   100 man-years of development
   Scalable archive solution from 1 TB up to several PB
   Common code base for Windows and Linux
   HSM approach – data gets migrated to tapes or disks
   Fileystem interface (NTFS, POSIX) simplyfies integration
   Support for all kinds of SCSI backend devices (FC, iSCSI)
   API only necessary for special purposes
   > 350 Customers worldwide
   References: Bundesarchiv, Charite, many DMS vendors
Current limitations of OPENARCHIVE

   Number of files and performance is limited by native
    filesystems (ext3/ext4, NTFS)
   Archiving and recalling of many files in parallel is slow
   Cluster file systems are not supported
   Performance is not appropriate for HPC environments
   Scalability might not be appropriate for cloud
    environments
High performance SCSI target: LIO
   LIO is the new standard SCSI target in Linux since 2.6.38
   LIO is devloped and supported by RisingTide System
   Completely kernel based SCSI target engine
   Support for cluster systems (PR, ALUA)
   Fabric modules for:
    ●   iSCSI (TCP/IP)
    ●   FCoE
    ●   FC (Qlogic)
    ●   Infiniband (SRP)
   Transparent blocklevel caching for SSD
   High speed iSCSI initiator (MP for performance and HA)
   RTSadmin for easy administration
LIO performance

                                    iSCSI numbers with
                                    standard Debian client.
                                    RTS iSCSI Initiator plus
                                    kernel patches give much
                                    higher results.




2 prozesses: 25% read – 75% write
Next generation cluster filesystem: CEPH
   Scalable storage system
    ●   1 to 1000s of nodes
    ●   Gigabytes to exabytes
   Reliable
    ●   No single points of failure
    ●   All data is replicated
    ●   Self-healing
   Self-managing
    ●   Automatically (re)distributes stored data
   Developed by Sage Weil (Dreamhost) based on Ph.D. Theses
    ●   Upstream in Linux since 2.6.34 (client), 2.6.37 (RDB)
CEPH design goals

   Avoid traditional system designs
    ●   Single server bottlenecks, points of failure
    ●   Symmetric shared disk (SAN)
   Avoid manual workload partition
    ●   Data sets, usage grow over time
    ●   Data migration is tedious
   Avoid ”passive” storage devices
    ●   Storage servers have CPUs; use them
Object storage
   Objects
    ●   Alphanumeric name
    ●   Data blob (bytes to gigabytes)
    ●   Named attributes (foo=bar)
   Object pools
    ●   Separate flat namespace
   Cluster of servers store all objects
    ●   RADOS: Reliable autonomic distributed object store
   Low-level storage infrastructure
    ●   librados, radosgw
    ●   RBD, Ceph distributed file system
CEPH data placement
    Files striped over objects                                File                       …
     ●   4 MB objects by default
                                                               Objects                        …
    Objects mapped to placement groups (PGs)
     ●   pgid = hash(object) & mask
                                                                                              …
    PGs mapped to sets of OSDs                                PGs                …   …   …



     ●   crush(cluster, rule, pgid) = [osd2, osd3]
     ●   ~100 PGs per node                                     OSDs
                                                               (grouped by
     ●   Pseudo-random, statistically uniform distribution      failure domain)




                         Fast– O(log n) calculation, no lookups
                         Reliable– replicas span failure domains
                         Stable– adding/removing OSDs moves few PGs
CEPH storage servers

   Ceph storage nodes (OSDs)
    ●   cosd object storage daemon
    ●   btrfs volume of one or more disks
   Actively collaborate with peers
    ●   Replicate data (n times–admin can choose)
    ●   Consistently apply updates
    ●   Detect node failures
    ●   Migrate data
Object placement
   OSDs acts intelligently - everyone knows where objects are located
    ●   Coordinate writes with replica peers
    ●   Copy or migrate objects to proper location
   OSD map completely specifies data placement
    ●   OSD cluster membership and state (up/down etc.)
    ●   CRUSH function mapping objects → PGs → OSDs
   cosd will ”peer” on startup or map change
    ●   Contact other replicas of PGs they store
    ●   Ensure PG contents are in sync, and stored on the correct nodes
   Identical, robust process for any map change
    ●   Node failure
    ●   Cluster expansion/contraction
    ●   Change in replication level
   Self-healing, self-managing
Why btrfs?

   Featureful
    ●   Copy on write, snapshots, checksumming, multi-device
   Leverage internal transactions, snapshots
    ●   OSDs need consistency points for sane recovery
   Hooks into copy-on-write infrastructure
    ●   Clone data content between files (objects)
   ext[34] can also work...
    ●   Inefficient snapshots, journaling
Object storage interface: librados
   Direct, parallel access to entire OSD cluster
   PaaS, SaaS applications
    ●   When objects are more appropriate than files
   C, C++, Python, Ruby, Java, PHP bindings
               rados_pool_t pool;

               rados_connect(...);
               rados_open_pool("mydata", &pool);

               rados_write(pool, ”foo”, 0, buf1, buflen);
               rados_read(pool, ”bar”, 0, buf2, buflen);
               rados_exec(pool, ”baz”, ”class”, ”method”,
                     inbuf, inlen, outbuf, outlen);

               rados_snap_create(pool, ”newsnap”);
               rados_set_snap(pool, ”oldsnap”);
               rados_read(pool, ”bar”, 0, buf2, buflen); /* old! */

               rados_close_pool(pool);
               rados_deinitialize();
Object storage interface: radosgw

   Proxy: no direct client access to storage nodes
   HTTP RESTful gateway
    ●   S3 and Swift protocols


                       http




                      ceph
RADOS block device: RBD

   Virtual disk image striped over objects
    ●   Reliable shared storage
    ●   Centralized management
    ●   VM migration between hosts
   Thinly provisioned
    ●   Consume disk only when image is written to
   Per-image snapshots
   Layering (WIP)
    ●   Copy-on-write overlay over existing 'gold' image
    ●   Fast creation or migration
RBD: RADOS block device
   Native Qemu/KVM (and libvirt) support
            $ qemu­img create ­f rbd rbd:mypool/myimage 10G
            $ qemu­system­x86_64 ­­drive format=rbd,file=rbd:mypool/myimage

   Linux kernel driver (2.6.37+)
            $ echo ”1.2.3.4 name=admin mypool myimage” > /sys/bus/rbd/add
            $ mke2fs ­j /dev/rbd0
            $ mount /dev/rbd0 /mnt


   Simple administration
    ●   CLI, librbd
            $ rbd create foo ­­size 20G
            $ rbd list
            foo
            $ rbd snap create ­­snap=asdf foo
            $ rbd resize foo ­­size=40G
            $ rbd snap create ­­snap=qwer foo
            $ rbd snap ls foo
            2       asdf    20971520
            3       qwer    41943040
Object classes

   Start with basic object methods
    ●   {read, write, zero} extent; truncate
    ●   {get, set, remove} attribute
    ●   delete
   Dynamically loadable object classes
    ●   Implement new methods based on existing ones
    ●   e.g. ”calculate SHA1 hash,” ”rotate image,” ”invert matrix”, etc.
   Moves computation to data
    ●   Avoid read/modify/write cycle over the network
    ●   e.g., MDS uses simple key/value methods to update objects
        containing directory content
POSIX filesystem

   Create file system hierarchy on top of objects
   Cluster of cmds daemons
    ●   No local storage – all metadata stored in objects
    ●   Lots of RAM – function has a large, distributed,
        coherent cache arbitrating file system access
   Dynamic cluster
    ●   New daemons can be started up dynamically
    ●   Automagically load balanced
POSIX example
                                                      Client                     MDS Cluster
   fd=open(”/foo/bar”, O_RDONLY)
    ●   Client: requests open from MDS
    ●   MDS: reads directory /foo from object store
    ●   MDS: issues capability for file content
   read(fd, buf, 1024)
    ●   Client: reads data from object store
   close(fd)
    ●   Client: relinquishes capability to MDS                        Object Store
    ●   MDS out of I/O path
    ●   Object locations are well known–calculated from object name
Dynamic subtree partitioning
                                  Root
                                                            MDS 0

                                                            MDS 1

                                                            MDS 2

                                                            MDS 3

                                                            MDS 4


              Busy directory fragmented across many MDS’s



   Scalable
    ●   Arbitrarily partition metadata, 10s-100s of nodes
   Adaptive
    ●   Move work from busy to idle servers
    ●   Replicate popular metadata on multiple nodes
Workload adaptation
   Extreme shifts in workload result in redistribution of metadata across cluster
    ●   Metadata initially managed by mds0 is migrated

                     many directories             same directory
Metadata scaling
   Up to 128 MDS nodes, and 250,000 metadata ops/second
    ●   I/O rates of potentially many terabytes/second
    ●   File systems containing many petabytes of data
Recursive accounting
   Subtree-based usage accounting
    ●   Recursive file, directory, byte counts, mtime
          $ ls ­alSh | head
          total 0
          drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
          drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
          drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
          drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
          drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
          drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
          drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
          drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
          drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
          $ getfattr ­d ­m ceph. pomceph
          # file: pomceph
          ceph.dir.entries="39"
          ceph.dir.files="37"
          ceph.dir.rbytes="10550153946827"
          ceph.dir.rctime="1298565125.590930000"
          ceph.dir.rentries="2454401"
          ceph.dir.rfiles="1585288"
          ceph.dir.rsubdirs="869113"
          ceph.dir.subdirs="2"
Fine-grained snapshots
   Snapshot arbitrary directory subtrees
    ●   Volume or subvolume granularity cumbersome at petabyte scale
   Simple interface
         $ mkdir foo/.snap/one    # create snapshot
         $ ls foo/.snap
         one
         $ ls foo/bar/.snap
         _one_1099511627776       # parent's snap name is mangled
         $ rm foo/myfile
         $ ls ­F foo
         bar/
         $ ls foo/.snap/one
         myfile  bar/
         $ rmdir foo/.snap/one    # remove snapshot



   Efficient storage
    ●   Leverages copy-on-write at storage layer (btrfs)
POSIX filesystem client

   POSIX; strong consistency semantics
    ●   Processes on different hosts interact as if on same host
    ●   Client maintains consistent data/metadata caches
   Linux kernel client
          # modprobe ceph
          # mount ­t ceph 10.3.14.95:/ /mnt/ceph
          # df ­h /mnt/ceph
          Filesystem            Size  Used Avail Use% Mounted on
          10.3.14.95:/           95T   29T   66T  31% /mnt/ceph

   Userspace client
    ●   cfuse FUSE-based client
    ●   libceph library (ceph_open(), etc.)
    ●   Hadoop, Hypertable client modules (libceph)
Infinistore: CEPH, LIO, OPENARCHIVE

   CEPH implements complete POSIX semantics
   OpenArchive plugs into VFS layer of Linux
   Integrated setup of CEPH, LIO and
    OpenArchive provides:
    ●   Scalable cluster filesystem
    ●   Scalable HSM for cloud and HPC setups
    ●   Scalable high available SAN storage (iSCSI, FC, IB)
Architecture

   Archive     ClusterFS               SAN                Object-
                                                           store




                 Filesystem            Blocklevel
                                     Blocklevel
      Open     Filesystem
               Filesystem            Blocklevel          S3/SWIFT
     Archive
     CEPH       CEPH                  Initiator           RADOSGW




                              LIO Target    LIO Target
 CEPH MDS
                                RDB               RDB

 CEPH OSD
Contact

Thomas Uhl

Cel: +49 170 7917711


thomas.uhl@graudata.com
tlu@risingtidesystems.com


www.twitter.com/tuhl
de.wikipedia.org/wiki/Thomas_Uhl

Weitere ähnliche Inhalte

Was ist angesagt?

NIIF Grid Development portfolio
NIIF Grid Development portfolioNIIF Grid Development portfolio
NIIF Grid Development portfolio
Ferenc Szalai
 

Was ist angesagt? (20)

Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
NIIF Grid Development portfolio
NIIF Grid Development portfolioNIIF Grid Development portfolio
NIIF Grid Development portfolio
 
Gluster for sysadmins
Gluster for sysadminsGluster for sysadmins
Gluster for sysadmins
 
GlusterFS And Big Data
GlusterFS And Big DataGlusterFS And Big Data
GlusterFS And Big Data
 
Postgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDBPostgres-XC as a Key Value Store Compared To MongoDB
Postgres-XC as a Key Value Store Compared To MongoDB
 
YDAL Barcelona
YDAL BarcelonaYDAL Barcelona
YDAL Barcelona
 
MongoDB Internals
MongoDB InternalsMongoDB Internals
MongoDB Internals
 
Tiering barcelona
Tiering barcelonaTiering barcelona
Tiering barcelona
 
Introduction to Postrges-XC
Introduction to Postrges-XCIntroduction to Postrges-XC
Introduction to Postrges-XC
 
Gluster intro-tdose
Gluster intro-tdoseGluster intro-tdose
Gluster intro-tdose
 
Lisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introductionLisa 2015-gluster fs-introduction
Lisa 2015-gluster fs-introduction
 
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo SeidelOSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
OSDC 2012 | Extremes Wolken Dateisystem!? by Dr. Udo Seidel
 
Gluster.community.day.2013
Gluster.community.day.2013Gluster.community.day.2013
Gluster.community.day.2013
 
SQL on linux
SQL on linuxSQL on linux
SQL on linux
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Microsoft Hekaton
Microsoft HekatonMicrosoft Hekaton
Microsoft Hekaton
 
Red Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS PlansRed Hat Gluster Storage, Container Storage and CephFS Plans
Red Hat Gluster Storage, Container Storage and CephFS Plans
 
Gdeploy 2.0
Gdeploy 2.0Gdeploy 2.0
Gdeploy 2.0
 
Gluster d2
Gluster d2Gluster d2
Gluster d2
 
Efficient data maintaince in GlusterFS using Databases
Efficient data maintaince in GlusterFS using DatabasesEfficient data maintaince in GlusterFS using Databases
Efficient data maintaince in GlusterFS using Databases
 

Ähnlich wie INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture

Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscana
Matteo Baglini
 

Ähnlich wie INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture (20)

Scale 10x 01:22:12
Scale 10x 01:22:12Scale 10x 01:22:12
Scale 10x 01:22:12
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Tech Talk: Bluestore
Ceph Tech Talk: BluestoreCeph Tech Talk: Bluestore
Ceph Tech Talk: Bluestore
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
 
Ceph Day New York 2014: Future of CephFS
Ceph Day New York 2014:  Future of CephFS Ceph Day New York 2014:  Future of CephFS
Ceph Day New York 2014: Future of CephFS
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
Key-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscanaKey-value databases in practice Redis @ DotNetToscana
Key-value databases in practice Redis @ DotNetToscana
 
Linuxtag.ceph.talk
Linuxtag.ceph.talkLinuxtag.ceph.talk
Linuxtag.ceph.talk
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 

INFINISTORE(tm) - Scalable Open Source Storage Arhcitecture

  • 1. INFINISTORE Scalable Open Source Storage CEPH, LIO, OPENARCHIVE
  • 2. OPENARCHIVE at a glance  Start of development in 2000  100 man-years of development  Scalable archive solution from 1 TB up to several PB  Common code base for Windows and Linux  HSM approach – data gets migrated to tapes or disks  Fileystem interface (NTFS, POSIX) simplyfies integration  Support for all kinds of SCSI backend devices (FC, iSCSI)  API only necessary for special purposes  > 350 Customers worldwide  References: Bundesarchiv, Charite, many DMS vendors
  • 3. Current limitations of OPENARCHIVE  Number of files and performance is limited by native filesystems (ext3/ext4, NTFS)  Archiving and recalling of many files in parallel is slow  Cluster file systems are not supported  Performance is not appropriate for HPC environments  Scalability might not be appropriate for cloud environments
  • 4. High performance SCSI target: LIO  LIO is the new standard SCSI target in Linux since 2.6.38  LIO is devloped and supported by RisingTide System  Completely kernel based SCSI target engine  Support for cluster systems (PR, ALUA)  Fabric modules for: ● iSCSI (TCP/IP) ● FCoE ● FC (Qlogic) ● Infiniband (SRP)  Transparent blocklevel caching for SSD  High speed iSCSI initiator (MP for performance and HA)  RTSadmin for easy administration
  • 5. LIO performance iSCSI numbers with standard Debian client. RTS iSCSI Initiator plus kernel patches give much higher results. 2 prozesses: 25% read – 75% write
  • 6. Next generation cluster filesystem: CEPH  Scalable storage system ● 1 to 1000s of nodes ● Gigabytes to exabytes  Reliable ● No single points of failure ● All data is replicated ● Self-healing  Self-managing ● Automatically (re)distributes stored data  Developed by Sage Weil (Dreamhost) based on Ph.D. Theses ● Upstream in Linux since 2.6.34 (client), 2.6.37 (RDB)
  • 7. CEPH design goals  Avoid traditional system designs ● Single server bottlenecks, points of failure ● Symmetric shared disk (SAN)  Avoid manual workload partition ● Data sets, usage grow over time ● Data migration is tedious  Avoid ”passive” storage devices ● Storage servers have CPUs; use them
  • 8. Object storage  Objects ● Alphanumeric name ● Data blob (bytes to gigabytes) ● Named attributes (foo=bar)  Object pools ● Separate flat namespace  Cluster of servers store all objects ● RADOS: Reliable autonomic distributed object store  Low-level storage infrastructure ● librados, radosgw ● RBD, Ceph distributed file system
  • 9. CEPH data placement  Files striped over objects File … ● 4 MB objects by default Objects …  Objects mapped to placement groups (PGs) ● pgid = hash(object) & mask …  PGs mapped to sets of OSDs PGs … … … ● crush(cluster, rule, pgid) = [osd2, osd3] ● ~100 PGs per node OSDs (grouped by ● Pseudo-random, statistically uniform distribution failure domain)  Fast– O(log n) calculation, no lookups  Reliable– replicas span failure domains  Stable– adding/removing OSDs moves few PGs
  • 10. CEPH storage servers  Ceph storage nodes (OSDs) ● cosd object storage daemon ● btrfs volume of one or more disks  Actively collaborate with peers ● Replicate data (n times–admin can choose) ● Consistently apply updates ● Detect node failures ● Migrate data
  • 11. Object placement  OSDs acts intelligently - everyone knows where objects are located ● Coordinate writes with replica peers ● Copy or migrate objects to proper location  OSD map completely specifies data placement ● OSD cluster membership and state (up/down etc.) ● CRUSH function mapping objects → PGs → OSDs  cosd will ”peer” on startup or map change ● Contact other replicas of PGs they store ● Ensure PG contents are in sync, and stored on the correct nodes  Identical, robust process for any map change ● Node failure ● Cluster expansion/contraction ● Change in replication level  Self-healing, self-managing
  • 12. Why btrfs?  Featureful ● Copy on write, snapshots, checksumming, multi-device  Leverage internal transactions, snapshots ● OSDs need consistency points for sane recovery  Hooks into copy-on-write infrastructure ● Clone data content between files (objects)  ext[34] can also work... ● Inefficient snapshots, journaling
  • 13. Object storage interface: librados  Direct, parallel access to entire OSD cluster  PaaS, SaaS applications ● When objects are more appropriate than files  C, C++, Python, Ruby, Java, PHP bindings rados_pool_t pool; rados_connect(...); rados_open_pool("mydata", &pool); rados_write(pool, ”foo”, 0, buf1, buflen); rados_read(pool, ”bar”, 0, buf2, buflen); rados_exec(pool, ”baz”, ”class”, ”method”,  inbuf, inlen, outbuf, outlen); rados_snap_create(pool, ”newsnap”); rados_set_snap(pool, ”oldsnap”); rados_read(pool, ”bar”, 0, buf2, buflen); /* old! */ rados_close_pool(pool); rados_deinitialize();
  • 14. Object storage interface: radosgw  Proxy: no direct client access to storage nodes  HTTP RESTful gateway ● S3 and Swift protocols http ceph
  • 15. RADOS block device: RBD  Virtual disk image striped over objects ● Reliable shared storage ● Centralized management ● VM migration between hosts  Thinly provisioned ● Consume disk only when image is written to  Per-image snapshots  Layering (WIP) ● Copy-on-write overlay over existing 'gold' image ● Fast creation or migration
  • 16. RBD: RADOS block device  Native Qemu/KVM (and libvirt) support $ qemu­img create ­f rbd rbd:mypool/myimage 10G $ qemu­system­x86_64 ­­drive format=rbd,file=rbd:mypool/myimage  Linux kernel driver (2.6.37+) $ echo ”1.2.3.4 name=admin mypool myimage” > /sys/bus/rbd/add $ mke2fs ­j /dev/rbd0 $ mount /dev/rbd0 /mnt  Simple administration ● CLI, librbd $ rbd create foo ­­size 20G $ rbd list foo $ rbd snap create ­­snap=asdf foo $ rbd resize foo ­­size=40G $ rbd snap create ­­snap=qwer foo $ rbd snap ls foo 2       asdf    20971520 3       qwer    41943040
  • 17. Object classes  Start with basic object methods ● {read, write, zero} extent; truncate ● {get, set, remove} attribute ● delete  Dynamically loadable object classes ● Implement new methods based on existing ones ● e.g. ”calculate SHA1 hash,” ”rotate image,” ”invert matrix”, etc.  Moves computation to data ● Avoid read/modify/write cycle over the network ● e.g., MDS uses simple key/value methods to update objects containing directory content
  • 18. POSIX filesystem  Create file system hierarchy on top of objects  Cluster of cmds daemons ● No local storage – all metadata stored in objects ● Lots of RAM – function has a large, distributed, coherent cache arbitrating file system access  Dynamic cluster ● New daemons can be started up dynamically ● Automagically load balanced
  • 19. POSIX example Client MDS Cluster  fd=open(”/foo/bar”, O_RDONLY) ● Client: requests open from MDS ● MDS: reads directory /foo from object store ● MDS: issues capability for file content  read(fd, buf, 1024) ● Client: reads data from object store  close(fd) ● Client: relinquishes capability to MDS Object Store ● MDS out of I/O path ● Object locations are well known–calculated from object name
  • 20. Dynamic subtree partitioning Root MDS 0 MDS 1 MDS 2 MDS 3 MDS 4 Busy directory fragmented across many MDS’s  Scalable ● Arbitrarily partition metadata, 10s-100s of nodes  Adaptive ● Move work from busy to idle servers ● Replicate popular metadata on multiple nodes
  • 21. Workload adaptation  Extreme shifts in workload result in redistribution of metadata across cluster ● Metadata initially managed by mds0 is migrated many directories same directory
  • 22. Metadata scaling  Up to 128 MDS nodes, and 250,000 metadata ops/second ● I/O rates of potentially many terabytes/second ● File systems containing many petabytes of data
  • 23. Recursive accounting  Subtree-based usage accounting ● Recursive file, directory, byte counts, mtime $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph $ getfattr ­d ­m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"
  • 24. Fine-grained snapshots  Snapshot arbitrary directory subtrees ● Volume or subvolume granularity cumbersome at petabyte scale  Simple interface $ mkdir foo/.snap/one    # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776       # parent's snap name is mangled $ rm foo/myfile $ ls ­F foo bar/ $ ls foo/.snap/one myfile  bar/ $ rmdir foo/.snap/one    # remove snapshot  Efficient storage ● Leverages copy-on-write at storage layer (btrfs)
  • 25. POSIX filesystem client  POSIX; strong consistency semantics ● Processes on different hosts interact as if on same host ● Client maintains consistent data/metadata caches  Linux kernel client # modprobe ceph # mount ­t ceph 10.3.14.95:/ /mnt/ceph # df ­h /mnt/ceph Filesystem            Size  Used Avail Use% Mounted on 10.3.14.95:/           95T   29T   66T  31% /mnt/ceph  Userspace client ● cfuse FUSE-based client ● libceph library (ceph_open(), etc.) ● Hadoop, Hypertable client modules (libceph)
  • 26. Infinistore: CEPH, LIO, OPENARCHIVE  CEPH implements complete POSIX semantics  OpenArchive plugs into VFS layer of Linux  Integrated setup of CEPH, LIO and OpenArchive provides: ● Scalable cluster filesystem ● Scalable HSM for cloud and HPC setups ● Scalable high available SAN storage (iSCSI, FC, IB)
  • 27. Architecture Archive ClusterFS SAN Object- store Filesystem Blocklevel Blocklevel Open Filesystem Filesystem Blocklevel S3/SWIFT Archive CEPH CEPH Initiator RADOSGW LIO Target LIO Target CEPH MDS RDB RDB CEPH OSD
  • 28. Contact Thomas Uhl Cel: +49 170 7917711 thomas.uhl@graudata.com tlu@risingtidesystems.com www.twitter.com/tuhl de.wikipedia.org/wiki/Thomas_Uhl