SlideShare ist ein Scribd-Unternehmen logo
1 von 160
USENIX 2009

   ZFS Tutorial
Richard.Elling@RichardElling.com
Agenda
 ●   Overview
 ●   Foundations
 ●   Pooled Storage Layer
 ●   Transactional Object Layer
 ●   Commands
        –   zpool
        –   zfs
 ●   Sharing
 ●   Properties
 ●   More goodies
 ●   Performance
 ●   Wrap


June 13, 2009                 © 2009 Richard Elling        2
History

●   Announced September 14, 2004
●   Integration history
     –   SXCE b27 (November 2005)
     –   FreeBSD (April 2007)
     –   Mac OSX Leopard (~ June 2007)
     –   OpenSolaris 2008.05
     –   Solaris 10 6/06 (June 2006)
     –   Linux FUSE (summer 2006)
     –   greenBytes ZFS+ (September 2008)
●   More than 45 patents, contributed to the CDDL Patents Common



June 13, 2009                   © 2009 Richard Elling              3
Brief List of Features
●   Future-proof                               ●   “No silent data corruption ever”
●   Cutting-edge data integrity                ●   “Mind-boggling scalability”
●   High performance                           ●   “Breathtaking speed”
●   Simplified administration                  ●   “Near zero administration”
●   Eliminates need for volume                 ●   “Radical new architecture”
    managers                                   ●   “Greatly simplifies support
●   Reduced costs                                  issues”
●   Compatibility with POSIX file              ●   “RAIDZ saves money”
    system & block devices
●   Self-healing


                                                          Marketing: 2 drink minimum

June 13, 2009                     © 2009 Richard Elling                               4
ZFS Design Goals

●   Figure out why storage has gotten so complicated
●   Blow away 20+ years of obsolete assumptions
●   Gotta replace UFS
●   Design an integrated system from scratch
●   End the suffering




June 13, 2009                  © 2009 Richard Elling     5
Limits

   248 — Number of entries in any individual directory
   256 — Number of attributes of a f le [1]
                                    i
   256 — Number of f les in a directory [1]
                     i
   16 EiB (264 bytes) — Maximum size of a f le system
                                            i
   16 EiB — Maximum size of a single file
   16 EiB — Maximum size of any attribute
   264 — Number of devices in any pool
   264 — Number of pools in a system
   264 — Number of f le systems in a pool
                     i
   264 — Number of snapshots of any f le system
                                        i
   256 ZiB (278 bytes) — Maximum size of any pool
   [1] actually constrained to 248 for the number of f les in a ZFS f le system
                                                     i              i


June 13, 2009                      © 2009 Richard Elling                          6
Sidetrack: Understanding Builds

●   Build is often referenced when speaking of feature/bug integration
●   Short-hand notation: b#
●   OpenSolaris and SXCE are based on NV
●   ZFS development done for NV
     –   Bi-weekly build cycle
     –   Schedule at http://opensolaris.org/os/community/on/schedule/
●   ZFS is ported to Solaris 10 and other OSes




June 13, 2009                    © 2009 Richard Elling                   7
Foundations


June 13, 2009      © 2009 Richard Elling   8
Overhead View of a Pool

                                       Pool
                                                           File System
                Configuration
                 Information


                                                                 Volume
                                    File System



                                                   Volume
                         Dataset




June 13, 2009                      © 2009 Richard Elling                  9
Layer View

raw     swap dump iSCSI   ??         ZFS     NFS CIFS        ??


  ZFS Volume Emulator (Zvol)         ZFS POSIX Layer (ZPL)             pNFS Lustre   ??


                           Transactional Object Layer


                               Pooled Storage Layer


                               Block Device Driver



                 HDD           SSD                iSCSI           ??



June 13, 2009                        © 2009 Richard Elling                                10
Source Code Structure
                                File system        Device         GUI        Mgmt
                                Consumer          Consumer
                                                                  JNI

 User                                                                   libzfs

 Kernel
                Interface
                                      ZPL             ZVol                /dev/zfs
                Layer

                Transactional       ZIL        ZAP                        Traversal
                Object
                Layer                              DMU            DSL


                                                   ARC
                Pooled
                Storage                            ZIO
                Layer
                                                  VDEV                   Configuration

June 13, 2009                             © 2009 Richard Elling                          11
Acronyms
 ●   ARC – Adaptive Replacement Cache
 ●   DMU – Data Management Unit
 ●   DSL – Dataset and Snapshot Layer
 ●   JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file
     system interface)
 ●   VDEV – Virtual Device layer
 ●   ZAP – ZFS Attribute Processor
 ●   ZIL – ZFS Intent Log
 ●   ZIO – ZFS I/O layer
 ●   Zvol – ZFS volume (raw/cooked block device interface)




June 13, 2009                  © 2009 Richard Elling                      12
nvlists

●   name=value pairs
●   libnvpair(3LIB)
●   Allows ZFS capabilities to change without changing the physical on-
    disk format
●   Data stored is XDR encoded
●   A good thing, used often




June 13, 2009                  © 2009 Richard Elling                      13
Versioning
 ●   Features can be added and identified by nvlist entries
 ●   Change in pool or dataset versions do not change physical on-disk
     format (!)
        –    does change nvlist parameters
 ●   Older-versions can be used
        –    might see warning messages, but harmless
 ●   Available versions and features can be easily viewed
        –    zpool upgrade -v
        –    zfs upgrade -v
 ●   Online references
        –    zpool: www.opensolaris.org/os/community/zfs/version/N
        –    zfs: www.opensolaris.org/os/community/zfs/version/zpl/N

            Don't confuse zpool and zfs versions
June 13, 2009                      © 2009 Richard Elling                 14
zpool versions
VER    DESCRIPTION
---    --------------------------------------------------------
 1     Initial ZFS version
 2     Ditto blocks (replicated metadata)
 3     Hot spares and double parity RAID-Z
 4     zpool history
 5     Compression using the gzip algorithm
 6     bootfs pool property
 7     Separate intent log devices
 8     Delegated administration
 9     refquota and refreservation properties
 10    Cache devices
 11    Improved scrub performance
 12    Snapshot properties
 13    snapused property
 14    passthrough-x aclinherit support
 15    user and group quotas
 16    COMSTAR support
June 13, 2009                     © 2009 Richard Elling                15
zfs versions
VER    DESCRIPTION
---    --------------------------------------------------------
 1     Initial ZFS filesystem version
 2     Enhanced directory entries
 3     Case insensitive and File system unique identifier (FUID)
 4     user and group quotas




June 13, 2009                   © 2009 Richard Elling              16
Copy on Write
                1. Initial block tree                             2. COW some data




                3. COW metadata                                 4. Update Uberblocks & free




June 13, 2009                           © 2009 Richard Elling                                 17
COW Notes
 ●   COW works on blocks, not files
 ●   ZFS reserves 32 MBytes or
     1/64 of pool size
        –   COWs need some free
             space to remove files
        –   need space for ZIL
 ●   For fixed-record size workloads
     “fragmentation” and “poor
     performance” can occur if the
     recordsize is not matched
 ●   Spatial distribution is good
     fodder for performance
     speculation
        –   affects HDDs
        –   moot for SSDs
June 13, 2009                       © 2009 Richard Elling           18
Pooled Storage Layer

  raw     swap dump iSCSI   ??         ZFS   NFS CIFS       ??


   ZFS Volume Emulator (Zvol)          ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                             Transactional Object Layer


                                 Pooled Storage Layer


                                 Block Device Driver



                   HDD           SSD              iSCSI          ??




June 13, 2009                       © 2009 Richard Elling                                19
vdevs – Virtual Devices
                                  Logical vdevs

                                        root vdev



                       top-level vdev                            top-level vdev
                         children[0]                               children[1]
                           mirror                                    mirror




                   vdev             vdev                vdev                  vdev
                type=disk        type=disk           type=disk             type=disk
                children[0]      children[1]         children[0]           children[1]


                              Physical or leaf vdevs

June 13, 2009                            © 2009 Richard Elling                           20
vdev Labels

●   vdev labels != disk labels
●   4 labels written to every physical vdev
●   Label size = 256kBytes
●   Two-stage update process
      –   write label0 & label2
      –   check for errors
      –   write label1 & label3


0            256k    512k           4M                           N-512k N-256k         N

                            Boot
    label0      label1                                               label2   label3
                            Block


June 13, 2009                            © 2009 Richard Elling                             21
vdev Label Contents

0            256k      512k              4M                            N-512k N-256k             N

                                 Boot
    label0        label1                                                    label2      label3
                                 Block




                 Boot            Name=Value
    Blank
                Header              Pairs
                                                             128-slot Uberblock Array

0            8k            16k                  128k                                     256k




June 13, 2009                                 © 2009 Richard Elling                                  22
Observing Labels
# zdb -l /dev/rdsk/c0t0d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    name='rpool'
    state=0
    txg=13152
    pool_guid=17111649328928073943
    hostid=8781271
    hostname=''
    top_guid=11960061581853893368
    guid=11960061581853893368
    vdev_tree
        type='disk'
        id=0
        guid=11960061581853893368
        path='/dev/dsk/c0t0d0s0'
        devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a'
        phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a'
        whole_disk=0
        metaslab_array=24
        metaslab_shift=30
        ashift=9
        asize=157945167872
        is_log=0




June 13, 2009                           © 2009 Richard Elling            23
Uberblocks

●   1 kByte
●   Stored in 128-entry circular queue
●   Only one uberblock is active at any time
     –   highest transaction group number
     –   correct SHA-256 checksum
●   Stored in machine's native format
     –   A magic number is used to determine endian format when
           imported
●   Contains pointer to MOS




June 13, 2009                   © 2009 Richard Elling             24
MOS – Meta Object Set

●   Only one MOS per pool
●   Contains object directory pointers
     –   root_dataset – references all top-level datasets in the pool
     –   config – nvlist describing the pool configuration
     –   sync_bplist – list of block pointers which need to be freed during
           the next transaction




June 13, 2009                     © 2009 Richard Elling                       25
Block Pointers
 ●   blkptr_t structure
 ●   128 bytes
 ●   contents:
        –   3x data virtual address (DVA)
        –   endianess
        –   level of indirection
        –   DMU object type
        –   checksum function
        –   compression function
        –   physical size
        –   logical size
        –   birth txg
        –   fill count
        –   checksum (256 bits)
June 13, 2009                      © 2009 Richard Elling                26
DVA – Data Virtual Address

●   Contains
     –   vdev id
     –   offset in sectors
     –   grid (future)
     –   allocated size
     –   gang block indicator
●   Physical block address = (offset << 9) + 4 MBytes




June 13, 2009                   © 2009 Richard Elling   27
Gang Blocks

●   Gang blocks contain block pointers
●   Used when space requested is not available in a contiguous block
●   512 bytes
●   self checksummed
●   contains 3 block pointers




June 13, 2009                   © 2009 Richard Elling                  28
To fsck or not to fsck
●   fsck was created to fix known inconsistencies in file system metadata
      –   UFS is not transactional
      –   metadata inconsistencies must be reconciled
      –   does NOT repair data – how could it?
●   ZFS doesn't need fsck, as-is
      –   all on-disk changes are transactional
      –   COW means previously existing, consistent metadata is not
           overwritten
      –   ZFS can repair itself
                ●   metadata is at least dual-redundant
                ●   data can also be redundant
●   Reality check – this does not mean that ZFS is not susceptible to
    corruption
      –   nor is any other file system
June 13, 2009                           © 2009 Richard Elling               29
VDEV


June 13, 2009   © 2009 Richard Elling   30
Dynamic Striping
 ●   RAID-0
        –   SNIA definition: fixed-length sequences of virtual disk data
             addresses are mapped to sequences of member disk
             addresses in a regular rotating pattern
 ●   Dynamic Stripe
        –   Data is dynamically mapped to member disks
        –   No fixed-length sequences
        –   Allocate up to ~1 MByte/vdev before changing vdev
        –   vdevs can be different size
        –   Good combination of the concatenation feature with RAID-0
             performance




June 13, 2009                      © 2009 Richard Elling                   31
Dynamic Striping

    RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes




    ZFS Dynamic Stripe recordsize = 128 kBytes




                 Total write size = 2816 kBytes

June 13, 2009                © 2009 Richard Elling               32
Mirroring
 ●   Straightforward: put N copies of the data on N vdevs
 ●   Unlike RAID-1
        –   No 1:1 mapping at the block level
        –   vdev labels are still at beginning and end
        –   vdevs can be of different size
                ●   effective space is that of smallest vdev
 ●   Arbitration: ZFS does not blindly trust either side of mirror
        –   Most recent, correct view of data wins
        –   Checksums validate data




June 13, 2009                           © 2009 Richard Elling           33
Mirroring




June 13, 2009   © 2009 Richard Elling           34
Dynamic vdev Replacement

●    zpool replace poolname vdev [vdev]
●    Today, replacing vdev must be same size or larger (as measured by
     blocks)
●    Replacing all vdevs in a top-level vdev with larger vdevs results in
     (automatic?) top-level vdev resizing




    15G 10G
      10G        15G   10G   20G     15G 20G
                                       10G                 15G   20G   20G   20G 20G
                                                                               10G

    10G Mirror     10G Mirror        15G Mirror              15G Mirror      20G Mirror




June 13, 2009                      © 2009 Richard Elling                              35
RAIDZ
 ●   RAID-5
        –   Parity check data is distributed across the RAID array's disks
        –   Must read/modify/write when data is smaller than stripe width
 ●   RAIDZ
        –   Dynamic data placement
        –   Parity added as needed
        –   Writes are full-stripe writes
        –   No read/modify/write (write hole)
 ●   Arbitration: ZFS does not blindly trust any device
        –   Does not rely on disk reporting read error
        –   Checksums validate data
        –   If checksum fails, read parity

            Space used is dependent on how used
June 13, 2009                       © 2009 Richard Elling                    36
RAID-5 vs RAIDZ

                DiskA   DiskB     DiskC      DiskD      DiskE
                 D0:0    D0:1      D0:2       D0:3       P0
    RAID-5       P1      D1:0      D1:1       D1:2       D1:3
                 D2:3    P2        D2:0       D2:1       D2:2
                 D3.2    D3:3       P3        D3:0       D3:1

                DiskA   DiskB    DiskC       DiskD      DiskE
                 P0     D0:0      D0:1       D0:2       D0:3
    RAIDZ        P1     D1:0      D1:1       P2:0       D2:0
                D2:1    D2:2      D2:3       P2:1       D2:4
                D2:5




June 13, 2009                   © 2009 Richard Elling           37
RAID-5 Write Hole
 ●   Occurs when data to be written is smaller than stripe size
 ●   Must read unallocated columns to recalculate the parity or the parity
     must be read/modify/write
 ●   Read/modify/write is risky for consistency
        –   Multiple disks
        –   Reading independently
        –   Writing independently
        –   System failure before all writes are complete to media could result
             in data loss
 ●   Effects can be hidden from host using RAID array with nonvolatile
     write cache, but extra I/O cannot be hidden from disks




June 13, 2009                       © 2009 Richard Elling                    38
RAIDZ2
 ●   RAIDZ2 = double parity RAIDZ
        –   Can recover data if any 2 leaf vdevs fail
 ●   Sorta like RAID-6
        –   Parity 1: XOR
        –   Parity 2: another Reed-Soloman syndrome
 ●   More computationally expensive than RAIDZ
 ●   Arbitration: ZFS does not blindly trust any device
        –   Does not rely on disk reporting read error
        –   Checksums validate data
        –   If data not valid, read parity
        –   If data still not valid, read other parity

            Space used is dependent on how used

June 13, 2009                        © 2009 Richard Elling            39
Evaluating Data Retention

●   MTTDL = Mean Time To Data Loss
●   Note: MTBF is not constant in the real world, but keeps math simple
●   MTTDL[1] is a simple MTTDL model
●   No parity (single vdev, striping, RAID-0)
     –   MTTDL[1] = MTBF / N
●   Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
     –   MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
●   Double Parity (3-way mirror, RAIDZ2, RAID-6)
     –   MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)




June 13, 2009                    © 2009 Richard Elling                    40
Another MTTDL Model
 ●   MTTDL[1] model doesn't take into account unrecoverable read
 ●   But unrecoverable reads (UER) are becoming the dominant failure
     mode
        –   UER specifed as errors per bits read
        –   More bits = higher probability of loss per vdev
 ●   MTTDL[2] model considers UER




June 13, 2009                      © 2009 Richard Elling               41
Why Worry about UER?

●   Richard's study
     –   3,684 hosts with 12,204 LUNs
     –   11.5% of all LUNs reported read errors
●   Bairavasundaram et.al. FAST08
    www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
     –   1.53M LUNs over 41 months
     –   RAID reconstruction discovers 8% of checksum mismatches
     –   4% of disks studies developed checksum errors over 17 months




June 13, 2009                   © 2009 Richard Elling                   42
Why Worry about UER?

●   RAID array study




June 13, 2009            © 2009 Richard Elling   43
MTTDL[2] Model

●   Probability that a reconstruction will fail
     –   Precon_fail = (N-1) * size / UER
●   Model doesn't work for non-parity schemes (single vdev, striping,
    RAID-0)
●   Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
     –   MTTDL[2] = MTBF / (N * Precon_fail)
●   Double Parity (3-way mirror, RAIDZ2, RAID-6)
     –   MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)




June 13, 2009                      © 2009 Richard Elling                44
Practical View of MTTDL[1]




June 13, 2009        © 2009 Richard Elling   45
MTTDL Models: Mirror




June 13, 2009    © 2009 Richard Elling   46
MTTDL Models: RAIDZ2




June 13, 2009     © 2009 Richard Elling   47
Ditto Blocks

●   Recall that each blkptr_t contains 3 DVAs
●   Allows up to 3 physical copies of the data



       ZFS copies parameter   Data copies                 Metadata copies
       default                1                           2
       copies=2               2                           3
       copies=3               3                           3




June 13, 2009                     © 2009 Richard Elling                      48
Copies

●   Dataset property used to indicate how many copies (aka ditto blocks)
    of data is desired
     –   Write all copies
     –   Read any copy
     –   Recover corrupted read from a copy
●   By default
     –   data copies=1
     –   metadata copies=data copies +1 or max=3
●   Not a replacement for mirroring
●   Easier to describe in pictures...



June 13, 2009                     © 2009 Richard Elling                    49
Copies in Pictures




June 13, 2009   © 2009 Richard Elling      50
Copies in Pictures




June 13, 2009   © 2009 Richard Elling      51
ZIO – ZFS I/O Layer


June 13, 2009          © 2009 Richard Elling   52
ZIO Framework
 ●   All physical disk I/O goes through ZIO Framework
 ●   Translates DVAs into Logical Block Address (LBA) on leaf vdevs
        –   Keeps free space maps (spacemap)
        –   If contiguous space is not available:
                ●   Allocate smaller blocks (the gang)
                ●   Allocate gang block, pointing to the gang
 ●   Implemented as multi-stage pipeline
        –   Allows extensions to be added fairly easily
 ●   Handles I/O errors




June 13, 2009                          © 2009 Richard Elling          53
SpaceMap from Space




June 13, 2009    © 2009 Richard Elling   54
ZIO Write Pipeline
ZIO State        Compression          Crypto       Checksum     DVA       vdev I/O

   open
                  compress if
                savings > 12.5%
                                     encrypt
                                                    generate
                                                               allocate

                                                                           start
                                                                            start
                                                                             start
                                                                           done
                                                                            done
                                                                             done
                                                                          assess
                                                                           assess
                                                                            assess

   done


                                  Gang activity elided, for clarity
June 13, 2009                          © 2009 Richard Elling                         55
ZIO Read Pipeline
ZIO State       Compression       Crypto       Checksum     DVA   vdev I/O

   open

                                                                   start
                                                                    start
                                                                     start
                                                                   done
                                                                    done
                                                                     done
                                                                  assess
                                                                   assess
                                                                    assess


                                                   verify

                                  decrypt
                decompress

   done



                              Gang activity elided, for clarity
June 13, 2009                      © 2009 Richard Elling                     56
VDEV – Virtual Device Subsytem
 ●   Where mirrors, RAIDZ, and RAIDZ2 are implemented
        –   Surprisingly few lines of code needed to implement RAID
 ●   Leaf vdev (physical device) I/O management
        –   Number of outstanding iops
        –   Read-ahead cache                              Name   Priority
                                                   NOW              0
 ●   Priority scheduling
                                                   SYNC_READ        0
                                                   SYNC_WRITE       0
                                                   FREE             0
                                                   CACHE_FILL       0
                                                   LOG_WRITE        0
                                                   ASYNC_READ       4
                                                   ASYNC_WRITE      4
                                                   RESILVER        10
                                                   SCRUB           20

June 13, 2009                     © 2009 Richard Elling                     57
ARC – Adaptive
                Replacement Cache


June 13, 2009         © 2009 Richard Elling   58
Object Cache
 ●   UFS uses page cache managed by the virtual memory system
 ●   ZFS does not use the page cache, except for mmap'ed files
 ●   ZFS uses a Adaptive Replacement Cache (ARC)
 ●   ARC used by DMU to cache DVA data objects
 ●   Only one ARC per system, but caching policy can be changed on a
     per-dataset basis
 ●   Seems to work much better than page cache ever did for UFS




June 13, 2009                 © 2009 Richard Elling                    59
Traditional Cache

●   Works well when data being accessed was recently added
●   Doesn't work so well when frequently accessed data is evicted
                Misses cause insert



                       MRU
                                                       Dynamic caches can change
                      Cache            size            size by either not evicting
                                                       or aggressively evicting

                       LRU


                  Evict the oldest


June 13, 2009                         © 2009 Richard Elling                          60
ARC – Adaptive Replacement
                                          Cache
                   Evict the oldest single-use entry


                                 LRU
                                Recent
                                Cache
                     Miss
                                MRU                           Evictions and dynamic
                                MRU                           resizing needs to choose best
                      Hit                            size
                                                              cache to evict (shrink)
                               Frequent
                                Cache

                                 LRU


                Evict the oldest multiple accessed entry

June 13, 2009                             © 2009 Richard Elling                               61
ZFS ARC – Adaptive Replacement
            Cache with Locked Pages
                                Evict the oldest single-use entry


                 Cannot evict                  LRU
                locked pages!
                                              Recent
                                              Cache
                                  Miss
                                               MRU
                                               MRU
                                   Hit                           size

                                            Frequent
                If hit occurs                Cache
                within 62 ms

                                               LRU


                           Evict the oldest multiple accessed entry

      ZFS ARC handles mixed-size pages

June 13, 2009                            © 2009 Richard Elling          62
ARC Directory
 ●   Each ARC directory entry contains arc_buf_hdr structs
        –   Info about the entry
        –   Pointer to the entry
 ●   Directory entries have size, ~200 bytes
 ●   ZFS block size is dynamic, 512 bytes – 128 kBytes
 ●   Disks are large
 ●   Suppose we use a Seagate LP 2 TByte disk for the L2ARC
        –   Disk has 3,907,029,168 512 byte sectors, guaranteed
        –   Workload uses 8 kByte fixed record size
        –   RAM needed for arc_buf_hdr entries
                ●   Need = 3,907,029,168 * 200 / 16 = 45 GBytes
 ●   Don't underestimate the RAM needed for large L2ARCs


June 13, 2009                         © 2009 Richard Elling               63
L2ARC – Level 2 ARC

●   ARC evictions are sent to cache vdev
●   ARC directory remains in memory
                                                          ARC
●   Works well when cache vdev is optimized for fast
    reads
     –   lower latency than pool disks
                                                         evicted
     –   inexpensive way to “increase memory”             data
●   Content considered volatile, no ZFS data
    protection allowed
●   Monitor usage with zpool iostat
                                                         “cache”
                                                          “cache”
                                                           “cache”
                                                           vdev
                                                            vdev
                                                             vdev



June 13, 2009                    © 2009 Richard Elling               64
ARC Tips
 ●   In general, it seems to work well for most workloads
 ●   ARC size will vary, based on usage
 ●   Internals tracked by kstats in Solaris
        –    Use memory_throttle_count to observe pressure to evict
 ●   Can limit at boot time
        –    Solaris – set zfs:zfs_arc_max in /etc/system
 ●   Performance
        –    Prior to b107, L2ARC fill rate was limited to 8 MBytes/s

            L2ARC keeps its directory in kernel memory




June 13, 2009                       © 2009 Richard Elling               65
Transactional Object
                      Layer


June 13, 2009          © 2009 Richard Elling   66
flash
                                       Source Code Structure
                                   File system        Device         GUI        Mgmt
                                   Consumer          Consumer
                                                                     JNI

 User                                                                      libzfs

 Kernel
                   Interface
                                         ZPL             ZVol                /dev/zfs
                   Layer

                   Transactional       ZIL        ZAP                        Traversal
                   Object
                   Layer                              DMU            DSL


                                                      ARC
                   Pooled
                   Storage                            ZIO
                   Layer
                                                     VDEV                   Configuration

June 13, 2009                                © 2009 Richard Elling                          67
ZAP – ZFS Attribute Processor
●   Module sits on top of DMU
●   Important component for managing everything
●   Operates on ZAP objects
      –   Contain name=value pairs
●   FatZAP
      –   Flexible architecture for storing large numbers of attributes
●   MicroZAP
      –   Lightweight version of fatzap
      –   Uses 1 block
      –   All name=value pairs must fit in block
      –   Names <= 50 chars (including NULL terminator)
      –   Values are type uint64_t

June 13, 2009                      © 2009 Richard Elling                  68
DMU – Data Management Layer

●   Datasets issue transactions to the DMU
●   Transactional based object model
●   Transactions are
     –   Atomic
     –   Grouped (txg = transaction group)
●   Responsible for on-disk data
●   ZFS Attribute Processor (ZAP)
●   Dataset and Snapshot Layer (DSL)
●   ZFS Intent Log (ZIL)




June 13, 2009                   © 2009 Richard Elling   69
Transaction Engine
 ●   Manages physical I/O
 ●   Transactions grouped into transaction group (txg)
        –    txg updates
        –    All-or-nothing
        –    Commit interval
                ●   Older versions: 5 seconds (zfs_
                ●   Now: 30 seconds max, dynamically scale based on time required
                     to commit txg
 ●   Delay committing data to physical storage
        –    Improves performance
        –    A bad thing for sync workloads – hence the ZFS Intent Log (ZIL)


            30 second delay could impact failure detection time

June 13, 2009                          © 2009 Richard Elling                        70
ZIL – ZFS Intent Log

●   DMU is transactional, and likes to group I/O into transactions for later
    commits, but still needs to handle “write it now” desire of sync writers
     –   NFS
     –   Databases
●   If I/O < 32 kBytes
     –   write it (now) to ZIL (allocated from pool)
     –   write it later as part of the txg commit
●   If I/O > 32 kBytes, write it to pool now
     –   Should be faster for large, sequential writes
●   Never read, except at import (eg reboot), when transactions may
    need to be rolled forward


June 13, 2009                      © 2009 Richard Elling                       71
Separate Logs (slogs)

●   ZIL competes with pool for iops
     –   Applications will wait for sync writes to be on nonvolatile media
     –   Very noticeable on HDD JBODs
●   Put ZIL on separate vdev, outside of pool
     –   ZIL writes tend to be sequential
     –   No competition with pool for iops
     –   Downside: slog device required to be operational at import
●   10x or more performance improvements possible
     –   Better if using write-optimized SSD or non-volatile write cache on
          RAID array
●   Use zilstat to observe ZIL activity


June 13, 2009                     © 2009 Richard Elling                       72
DSL – Dataset and
                 Snapshot Layer


June 13, 2009         © 2009 Richard Elling   73
flash
                                                                 Copy on Write
                 1. Initial block tree                             2. COW some data




                 3. COW metadata                                 4. Update Uberblocks & free




June 13, 2009                            © 2009 Richard Elling                                 74
zfs snapshot
 ●   Create a read-only, point-in-time window into the dataset (file system
     or Zvol)
 ●   Computationally free, because of COW architecture
 ●   Very handy feature
        –   Patching/upgrades
        –   Basis for Time Slider




June 13, 2009                       © 2009 Richard Elling                 75
Snapshot

          Snapshot tree root                           Current tree root




 ●   Create a snapshot by not free'ing COWed blocks
 ●   Snapshot creation is fast and easy
 ●   Number of snapshots determined by use – no hardwired limit
 ●   Recursive snapshots also possible

June 13, 2009                  © 2009 Richard Elling                       76
Clones

●   Snapshots are read-only
●   Clones are read-write based upon a snapshot
●   Child depends on parent
     –   Cannot destroy parent without destroying all children
     –   Can promote children to be parents
●   Good ideas
     –   OS upgrades
     –   Change control
     –   Replication
                ●   zones
                ●   virtual disks


June 13, 2009                       © 2009 Richard Elling         77
zfs clone
 ●   Create a read-write file system from a read-only snapshot
 ●   Used extensively for OpenSolaris upgrades


  OS rev1       OS rev1    OS rev1                   OS rev1



                OS rev1    OS rev1                  OS rev1
                snapshot   snapshot                 snapshot


                           OS rev1
                                      upgrade        OS rev2
                            clone

                                                                 boot
                                                                manager


          Origin snapshot cannot be destroyed, if clone exists
June 13, 2009                     © 2009 Richard Elling                    78
zfs promote


            OS b104                           OS b104
                                                clone        OS rev1
      rpool/ROOT/b104                     rpool/ROOT/b104

        OS b104                             OS b105
                                                             OS rev2
        snapshot                            snapshot
                                                             snapshot
rpool/ROOT/b104@today               rpool/ROOT/b105@today

          OS b105                              OS b105
            clone       promote                              OS rev2
      rpool/ROOT/b105                     rpool/ROOT/b105




June 13, 2009                     © 2009 Richard Elling                 79
zfs rollback


            OS b104                                    OS b104

      rpool/ROOT/b104                               rpool/ROOT/b104

        OS b104                                          OS b104
        snapshot        rollback                         snapshot
rpool/ROOT/b104@today                            rpool/ROOT/b104@today




June 13, 2009            © 2009 Richard Elling                           80
Commands


June 13, 2009     © 2009 Richard Elling   81
zpool(1m)

  raw     swap dump iSCSI   ??         ZFS   NFS CIFS       ??


   ZFS Volume Emulator (Zvol)          ZFS POSIX Layer (ZPL)          pNFS Lustre   ??


                             Transactional Object Layer


                                 Pooled Storage Layer


                                 Block Device Driver



                   HDD           SSD              iSCSI          ??




June 13, 2009                       © 2009 Richard Elling                                82
Dataset & Snapshot Layer
 ●   Object
        –   Allocated storage
        –   dnode describes collection                      Dataset Directory
             of blocks
                                                          Dataset
 ●   Object Set
                                                      Object Set         Childmap
        –   Group of related objects
 ●   Dataset                                              Object
                                                          Object
                                                           Object        Properties
        –   Snapmap: snapshot
             relationships                                Snapmap
        –   Space usage
 ●   Dataset directory
        –   Childmap: dataset
             relationships
        –   Properties
June 13, 2009                     © 2009 Richard Elling                               83
zpool create
 ●   zpool create poolname vdev-configuration
        –    vdev-configuration examples
                ●   mirror c0t0d0 c3t6d0
                ●   mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
                ●   mirror disk1s0 disk2s0 cache disk4s0 log disk5
                ●   raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
 ●   Solaris
        –    Additional checks to see if disk/slice overlaps or is currently in use
        –    Whole disks are given EFI labels
 ●   Can set initial pool or dataset properties
 ●   By default, creates a file system with the same name
        –    poolname pool → /poolname file system
            People get confused by a file system with same
            name as the pool
June 13, 2009                          © 2009 Richard Elling                     84
zpool destroy
 ●   Destroy the pool and all datasets therein
 ●   zpool destroy poolname
 ●   Can (try to) force with “-f”
 ●   There is no “are you sure?” prompt – if you weren't sure, you would
     not have typed “destroy”




          zpool destroy is destructive... really! Use with caution!
June 13, 2009                       © 2009 Richard Elling                  85
zpool add
 ●   Adds a device to the pool as a top-level vdev
 ●   zpool add poolname vdev-configuration
 ●   vdev-configuration can be any combination also used for zpool create
 ●   Complains if the added vdev-configuration would cause a different
     data protection scheme than is already in use – use “-f” to override
 ●   Good idea: try with “-n” flag first – will show final configuration without
     actually performing the add




          Do not add a device which is in use as a quorum device
June 13, 2009                    © 2009 Richard Elling                         86
zpool remove
 ●   Remove a top-level vdev from the pool
 ●   zpool remove poolname vdev
 ●   Today, you can only remove the following vdevs:
        –    cache
        –    hot spare
 ●   An RFE is open to allow removal of other top-level vdevs




            Don't confuse “remove” with “detach”
June 13, 2009                      © 2009 Richard Elling              87
zpool attach
 ●   Attach a vdev as a mirror to an existing vdev
 ●   zpool attach poolname existing-vdev vdev
 ●   Attaching vdev must be the same size or larger than the existing vdev
 ●   Note: today this is not available for RAIDZ or RAIDZ2 vdevs

                             vdev Configurations
                        ok   simple vdev → mirror
                        ok   mirror
                        ok   log → mirrored log
                       no    RAIDZ
                       no    RAIDZ2



          “Same size” literally means the same number of blocks.
          Beware that many “same size” disks have different
          number of available blocks.
June 13, 2009                     © 2009 Richard Elling                 88
zpool detach
 ●   Detach a vdev from a mirror
 ●   zpool detach poolname vdev
 ●   A resilvering vdev will wait until resilvering is complete




June 13, 2009                    © 2009 Richard Elling              89
zpool replace
 ●   Replaces an existing vdev with a new vdev
 ●   zpool replace poolname existing-vdev vdev
 ●   Effectively, a shorthand for “zpool attach” followed by “zpool detach”
 ●   Attaching vdev must be the same size or larger than the existing vdev
 ●   Works for any top-level vdev-configuration, including RAIDZ and
     RAIDZ2
                            vdev Configurations
                       ok   simple vdev
                       ok   mirror
                       ok   log
                       ok   RAIDZ
                       ok   RAIDZ2

          “Same size” literally means the same number of blocks.
          Beware that many “same size” disks have different
          number of available blocks.
June 13, 2009                        © 2009 Richard Elling                    90
zpool import
 ●   Import a pool and mount all mountable datasets
 ●   Import a specific pool
      – zpool import poolname
      – zpool import GUID
 ●   Scan LUNs for pools which may be imported
      – zpool import
 ●   Can set options, such as alternate root directory or other properties




          Beware of zpool.cache interactions

          Beware of artifacts, especially partial artifacts

June 13, 2009                      © 2009 Richard Elling                     91
zpool export
 ●   Unmount datasets and export the pool
 ●   zpool export poolname
 ●   Removes pool entry from zpool.cache




June 13, 2009                 © 2009 Richard Elling              92
zpool upgrade
 ●   Display current versions
      – zpool upgrade
 ●   View available upgrade versions, with features, but don't actually
     upgrade
      – zpool upgrade -v
 ●   Upgrade pool to latest version
      – zpool upgrade poolname
 ●   Upgrade pool to specific version
      – zpool upgrade -V version poolname




          Once you upgrade, there is no downgrade

June 13, 2009                   © 2009 Richard Elling                     93
zpool history
 ●   Show history of changes made to the pool

  # zpool history rpool
  History for 'rpool':
  2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
  cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
  2009-03-04.07:29:47 zfs set canmount=noauto rpool
  2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
  2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
  2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
  2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
  2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
  2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
  2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
  2009-03-04.07:29:51 zfs set canmount=on rpool
  2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
  2009-03-04.07:29:51 zfs create rpool/export/home
  2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
  2009-03-04.00:21:42 zpool export rpool
  2009-03-04.08:47:08 zpool set bootfs=rpool rpool
  2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
  2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
  2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108
  ...




June 13, 2009                    © 2009 Richard Elling                        94
zpool status
 ●   Shows the status of the current pools, including their configuration
 ●   Important troubleshooting step

 # zpool status
 …
   pool: stuff
   state: ONLINE
 status: The pool is formatted using an older on-disk format. The pool can
          still be used, but some features are unavailable.
 action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
          pool will no longer be accessible on older software versions.
   scrub: none requested
 config:

                NAME           STATE    READ WRITE CKSUM
                stuff          ONLINE      0     0     0
                  mirror       ONLINE      0     0     0
                    c0t2d0s0   ONLINE      0     0     0
                    c0t0d0s7   ONLINE      0     0     0
 errors: No known data errors



          Understanding status output error messages can be tricky

June 13, 2009                           © 2009 Richard Elling                95
zpool clear
 ●   Clears device errors
 ●   Clears device error counters
 ●   Improves sysadmin sanity and reduces sweating




June 13, 2009                  © 2009 Richard Elling             96
zpool iostat
 ●   Show pool physical I/O activity, in an iostat-like manner
 ●   Solaris: fsstat will show I/O activity looking into a ZFS file system
 ●   Especially useful for showing slog activity

                # zpool iostat -v
                                  capacity       operations        bandwidth
                pool            used avail      read write        read write
                ------------   ----- -----     ----- -----       ----- -----
                rpool          16.5G   131G        0      0      1.16K 2.80K
                  c0t0d0s0     16.5G   131G        0      0      1.16K 2.80K
                ------------   ----- -----     ----- -----       ----- -----
                stuff           135G 14.4G         0      5      2.09K 27.3K
                  mirror        135G 14.4G         0      5      2.09K 27.3K
                    c0t2d0s0       -       -       0      3      1.25K 27.5K
                    c0t0d0s7       -       -       0      2      1.27K 27.5K
                ------------   ----- -----     ----- -----       ----- -----




          Unlike iostat, does not show latency

June 13, 2009                            © 2009 Richard Elling                   97
zpool scrub
 ●   Manually starts scrub
      – zpool scrub poolname
 ●   Scrubbing performed in background
 ●   Use zpool status to track scrub progress
 ●   Stop scrub
      – zpool scrub -s poolname




          Estimated scrub completion time improves over time

June 13, 2009                   © 2009 Richard Elling                 98
zfs(1m)
 ●    Manages file systems (ZPL) and Zvols
 ●    Can proxy to other, related commands
        –    iSCSI, NFS, CIFS




  raw       swap dump iSCSI   ??      ZFS      NFS CIFS       ??


     ZFS Volume Emulator (Zvol)        ZFS POSIX Layer (ZPL)       pNFS Lustre   ??


                               Transactional Object Layer


                                   Pooled Storage Layer




June 13, 2009                         © 2009 Richard Elling                           99
zfs create, destroy

●   By default, a file system with the same name as the pool is created by
    zpool create
●   Name format is: pool/name[/name ...]
●   File system
     –   zfs create fs-name
     –   zfs destroy fs-name
●   Zvol
     –   zfs create -V size vol-name
     –   zfs destroy vol-name
●   Parameters can be set at create time



June 13, 2009                    © 2009 Richard Elling                   100
zfs mount, unmount

●   Note: mount point is a file system parameter
     –   zfs get mountpoint fs-name
●   Rarely used subcommand (!)
●   Display mounted file systems
     –   zfs mount
●   Mount a file system
     –   zfs mount fs-name
     –   zfs mount -a
●   Unmount
     –   zfs unmount fs-name
     –   zfs unmount -a

June 13, 2009                   © 2009 Richard Elling   101
zfs list
 ●   List mounted datasets
 ●   Old versions: listed everything
 ●   New versions: do not list snapshots
 ●   Examples
        –   zfs list
        –   zfs list -t snapshot
        –   zfs list -H -o name




June 13, 2009                      © 2009 Richard Elling          102
zfs send, receive
 ●   Send
        –   send a snapshot to stdout
        –   data is decompressed
 ●   Receive
        –   receive a snapshot from stdin
        –   receiving file system parameters apply (compression, et.al)
 ●   Can incrementally send snapshots in time order
 ●   Handy way to replicate dataset snapshots
 ●   NOT a replacement for traditional backup solutions
        –   All-or-nothing design per snapshot
        –   In general, does not send files (!)
        –   Today, no per-file management

         Send streams from b35 (or older) no longer supported after b89
June 13, 2009                       © 2009 Richard Elling                 103
zfs rename
 ●   Renames a file system, volume,or snapshot
        –   zfs rename export/home/relling export/home/richard




June 13, 2009                 © 2009 Richard Elling              104
zfs upgrade
 ●   Display current versions
      – zfs upgrade
 ●   View available upgrade versions, with features, but don't actually
     upgrade
      – zfs upgrade -v
 ●   Upgrade pool to latest version
      – zfs upgrade dataset
 ●   Upgrade pool to specific version
      – zfs upgrade -V version dataset




          Once you upgrade, there is no downgrade

June 13, 2009                   © 2009 Richard Elling                     105
Sharing


June 13, 2009    © 2009 Richard Elling   106
Sharing

●   zfs share dataset
●   Type of sharing set by parameters
     –   shareiscsi = [on | off]
     –   sharenfs = [on | off | options]
     –   sharesmb = [on | off | options]
●   Shortcut to manage sharing
     –   Uses external services (nfsd, COMSTAR, etc)
     –   Importing pool will also share
●   May vary by OS




June 13, 2009                      © 2009 Richard Elling         107
NFS
 ●   ZFS file systems work as expected
        –   use ACLs based on NFSv4 ACLs
 ●   Parallel NFS, aks pNFS, aka NFSv4.1
        –   Still a work-in-progress
        –   http://opensolaris.org/os/project/nfsv41/
        –   zfs create -t pnfsdata mypnfsdata

                pNFS
                Client
                                       pNFS Data Server     pNFS Data Server

                                           pnfsdata             pnfsdata
                          pNFS             dataset              dataset
                         Metadata
                          Server             pool                 pool




June 13, 2009                       © 2009 Richard Elling                      108
CIFS

●   UID mapping
●   casesensitivity parameter
     –   Good idea, set when file system is created
     –   zfs create -o casesensitivity=insensitive mypool/Shared
●   Shadow Copies for Shared Folders (VSS) supported
     –   CIFS clients cannot create shadow remotely (yet)




                    CIFS features vary by OS, Samba, etc.


June 13, 2009                    © 2009 Richard Elling                109
iSCSI
 ●   SCSI over IP
 ●   Block-level protocol
 ●   Uses Zvols as storage
 ●   Solaris has 2 iSCSI target implementations
        –   shareiscsi enables old, klunky iSCSI target
        –   To use COMSTAR, enable using itadm(1m)
        –   b116 adds COMSTAR support (zpool version 16)




June 13, 2009                     © 2009 Richard Elling        110
Properties


June 13, 2009     © 2009 Richard Elling   111
Properties
 ●   Properties are stored in an nvlist
 ●   By default, are inherited
 ●   Some properties are common to all datasets, but a specific dataset
     type may have additional properties
 ●   Easily set or retrieved via scripts
 ●   In general, properties affect future file system activity




          zpool get doesn't script as nicely as zfs get




June 13, 2009                      © 2009 Richard Elling                  112
User-defined Properties
 ●   Names
        –   Must include colon ':'
        –   Can contain lower case alphanumerics or “+” “.” “_”
        –   Max length = 256 characters
        –   By convention, module:property
                ●   com.sun:auto-snapshot
 ●   Values
        –   Max length = 1024 characters
 ●   Examples
        –   com.sun:auto-snapshot=true
        –   com.richardelling:important_files=true




June 13, 2009                        © 2009 Richard Elling        113
set & get properties
 ●   Set
        –   zfs set compression=on export/home/relling
 ●   Get
        –   zfs get compression export/home/relling
 ●   Reset to inherited value
      – zfs inherit compression export/home/relling
 ●   Clear user-defined parameter
      – zfs inherit com.sun:auto-snapshot
           export/home/relling




June 13, 2009               © 2009 Richard Elling        114
Pool Properties
       Property   Change?                          Brief Description
  altroot                    Alternate root directory (ala chroot)
  autoreplace                vdev replacement policy
  available       readonly   Available storage space
  bootfs                     Default bootable dataset for root pool
  cachefile                  Cache file to use other than /etc/zfs/zpool.cache
  capacity        readonly   Percent of pool space used
  delegation                 Master pool delegation switch
  failmode                   Catastrophic pool failure policy
  guid            readonly   Unique identifier
  health          readonly   Current health of the pool
  listsnapshots              zfs list policy
  size            readonly   Total size of pool
  used            readonly   Amount of space used
  version         readonly   Current on-disk version

June 13, 2009                     © 2009 Richard Elling                          115
Common Dataset Properties
       Property    Change?                          Brief Description
  available        readonly   Space available to dataset & children
  checksum                    Checksum algorithm
  compression                 Compression algorithm
  compressratio    readonly   Compression ratio – logical size:referenced physical
  copies                      Number of copies of user data
  creation         readonly   Dataset creation time
  origin           readonly   For clones, origin snapshot
  primarycache                ARC caching policy
  readonly                    Is dataset in readonly mode?
  referenced       readonly   Size of data accessible by this dataset
  refreservation              Max space guaranteed to a dataset, not including
                              descendants (snapshots & clones)
  reservation                 Minimum space guaranteed to dataset, including
                              descendants


June 13, 2009                      © 2009 Richard Elling                             116
Common Dataset Properties
          Property       Change?                        Brief Description
  secondarycache                    L2ARC caching policy
  type                   readonly   Type of dataset (filesystem, snapshot, volume)
  used                   readonly   Sum of usedby* (see below)
  usedbychildren         readonly   Space used by descendants
  usedbydataset          readonly   Space used by dataset
  usedbyrefreservation   readonly   Space used by a refreservation for this dataset
  usedbysnapshots        readonly   Space used by all snapshots of this dataset
  zoned                  readonly   Is dataset added to non-global zone (Solaris)




June 13, 2009                       © 2009 Richard Elling                             117
File System Dataset Properties
      Property     Change?                         Brief Description
 aclinherit                   ACL inheritance policy, when files or directories are
                              created
 aclmode                      ACL modification policy, when chmod is used
 atime                        Disable access time metadata updates
 canmount                     Mount policy
 casesensitivity   creation   Filename matching algorithm
 devices                      Device opening policy for dataset
 exec                         File execution policy for dataset
 mounted           readonly   Is file system currently mounted?
 nbmand            export/    File system should be mounted with non-blocking
                   import     mandatory locks (CIFS client feature)
 normalization     creation   Unicode normalization of file names for matching




June 13, 2009                      © 2009 Richard Elling                              118
File System Dataset Properties
      Property    Change?                        Brief Description
 quota                      Max space dataset and descendants can consume
 recordsize                 Suggested maximum block size for files
 refquota                   Max space dataset can consume, not including
                            descendants
 setuid                     setuid mode policy
 sharenfs                   NFS sharing options
 sharesmb                   CIFS sharing options
 snapdir                    Controls whether .zfs directory is hidden
 utf8only                   UTF-8 character file name policy
 vscan                      Virus scan enabled
 xattr                      Extended attributes policy




June 13, 2009                    © 2009 Richard Elling                      119
More Goodies...


June 13, 2009        © 2009 Richard Elling   120
Dataset Space Accounting
 ●   used = usedbydataset + usedbychildren + usedbysnapshots +
     usedbyrefreservation
 ●   Lazy updates, may not be correct until txg commits
 ●   ls and du will show size of allocated files which includes all copies of
     a file
 ●   Shorthand report available

  $ zfs list -o space
  NAME                  AVAIL    USED   USEDSNAP       USEDDS   USEDREFRESERV   USEDCHILD
  rpool                  126G   18.3G          0        35.5K               0       18.3G
  rpool/ROOT             126G   15.3G          0          18K               0       15.3G
  rpool/ROOT/snv_106     126G   86.1M          0        86.1M               0           0
  rpool/ROOT/snv_b108    126G   15.2G      5.89G        9.28G               0           0
  rpool/dump             126G   1.00G          0        1.00G               0           0
  rpool/export           126G     37K          0          19K               0         18K
  rpool/export/home      126G     18K          0          18K               0           0
  rpool/swap             128G      2G          0         193M           1.81G           0




June 13, 2009                      © 2009 Richard Elling                               121
zfs vs zpool Space Accounting
 ●   zfs list != zpool list
 ●   zfs list shows space used by the dataset plus space for internal
     accounting
 ●   zpool list shows physical space available to the pool
 ●   For simple pools and mirrors, they are nearly the same
 ●   For RAIDZ or RAIDZ2, zpool list will show space available for parity



          Users will be confused about reported space available




June 13, 2009                    © 2009 Richard Elling                      122
Testing

●   ztest
●   fstest




June 13, 2009   © 2009 Richard Elling         123
Accessing Snapshots
 ●   By default, snapshots are accessible in .zfs directory
 ●   Visibility of .zfs directory is tunable via snapdir property
        –   Don't really want find to find the .zfs directory
 ●   Windows CIFS clients can see snapshots as Shadow Copies for
     Shared Folders (VSS)


                # zfs snapshot rpool/export/home/relling@20090415
                # ls -a /export/home/relling
                …
                .Xsession
                .xsession-errors
                # ls /export/home/relling/.zfs
                shares    snapshot
                # ls /export/home/relling/.zfs/snapshot
                20090415
                # ls /export/home/relling/.zfs/snapshot/20090415
                Desktop Documents Downloads Public




June 13, 2009                       © 2009 Richard Elling           124
Resilver & Scrub
 ●   Can be read iops bound
 ●   Resilver can also be bandwidth bound to the resilvering device
 ●   Both work at lower I/O scheduling priority than normal work, but that
     may not matter for read iops bound devices




June 13, 2009                   © 2009 Richard Elling                    125
Time-based Resilvering
 ●   Block pointers contain birth txg
     number
 ●   Resilvering begins with oldest
     blocks first                                                 73
                                                                   73
 ●   Interrupted resilver will still
     result in a valid file system view                         73 55
                                                                 73 27


                                                         68 73            27   27
                                                          73 68




                                                             Birth txg = 27
                                                             Birth txg = 68
                                                             Birth txg = 73


June 13, 2009                    © 2009 Richard Elling                              126
Time Slider – Automatic Snapshots
●   Underpinnings for Solaris feature similar to OSX's Time Machine
●   SMF service for managing snapshots
●   SMF properties used to specify policies
      –   Frequency
      –   Number to keep
●   Creates cron jobs
●   GUI tool makes it easy to select individual file systems

                 Service Name         Interval (default)        Keep (default)
           auto-snapshot:frequent    15 minutes             4
           auto-snapshot:hourly      1 hour                 24
           auto-snapshot:daily       1 day                  31
           auto-snapshot:weekly      7 days                 4
           auto-snapshot:monthly     1 month                12


June 13, 2009                       © 2009 Richard Elling                        127
Nautilus

●   File system views which can go back in time




June 13, 2009                  © 2009 Richard Elling          128
ACL – Access Control List
 ●   Based on NFSv4 ACLs
 ●   Similar to Windows NT ACLs
 ●   Works well with CIFS services
 ●   Supports ACL inheritance
 ●   Change using chmod
 ●   View using ls




June 13, 2009                   © 2009 Richard Elling   129
Checksums
●   DVA contains 256 bits for checksum
●   Checksum is in the parent, not in the block itself
●   Types
      –   none
      –   fletcher2: truncated Fletcher algorithm
      –   fletcher4: full Fletcher algorithm
      –   SHA-256
●   There are open proposals for better algorithms




June 13, 2009                      © 2009 Richard Elling           130
Checksum Use


                Use         Algorithm                       Notes
         Uberblock    SHA-256                     self-checksummed
         Metadata     fletcher4
         Labels       SHA-256
         Data         fletcher2 (default)         zfs compression parameter
         ZIL log      fletcher2                   self-checksummed
         Gang block   SHA-256                     self-checksummed




June 13, 2009                     © 2009 Richard Elling                       131
Checksum Performance
 ●   Metadata – you won't notice
 ●   Data
        –   LZJB is barely noticeable
        –   gzip-9 can be very noticeable
 ●   Geriatric hardware ???




June 13, 2009                     © 2009 Richard Elling   132
Compression

●   Builtin
     –   lzjb, Lempel-Ziv by Jeff Bonwick
     –   gzip, levels 1-9
●   Extensible
     –   new compressors can be added
     –   backwards compatibility issues
●   Uses taskqs to take advantage of multi-processor systems



          Cannot boot from gzip compressed root (RFE is open)



June 13, 2009                    © 2009 Richard Elling             133
Encryption
 ●   Placeholder – details TBD
 ●   http://opensolaris.org/os/project/zfs-crypto
 ●   Complicated by:
        –   Block pointer rewrites
        –   Deduplication




June 13, 2009                        © 2009 Richard Elling            134
Impedance Matching
 ●   RAID arrays & columns
 ●   Label offsets
        –   Older Solaris starting block = 34
        –   Newer Solaris starting block = 256




June 13, 2009                      © 2009 Richard Elling   135
Quotas

●   File system quotas
     –   quota includes descendants (snapshots, clones)
     –   refquota does not include descendants
●   User and group quotas (b114)
     –   Works like refquota, descendants don't count
     –   Not inherited
     –   zfs userspace and groupspace subcommands show quotas
                ●   Users can only see their own and group quota, but can delegate
     –   Managed via properties
                ●   [user|group]quota@[UID|username|SID name|SID number]
                ●   not visible via zfs get all


June 13, 2009                               © 2009 Richard Elling                    136
zpool.cache
●   Old way
      –   mount /
      –   read /etc/[v]fstab
      –   mount file systems
●   ZFS
      –   import pool(s)
      –   find mountable datasets and mount them
●   /etc/zpool.cache is a cache of pools to be imported at boot time
      –   No scanning of all available LUNs for pools to import
      –   cachefile property permits selecting an alternate zpool.cache
                ●   Useful for OS installers
                ●   Useful for clusters, where you don't want a booting node to
                     automatically import a pool
                ●   Not persistent (!)
June 13, 2009                            © 2009 Richard Elling                    137
Mounting ZFS File Systems
 ●   By default, mountable file systems are mounted when the pool is
     imported
        –    Controlled by canmount policy (not inherited)
                ●   on – (default) file system is mountable
                ●   off – file system is not mountable
                        –   if you want children to be mountable, but not the parent
                ●   noauto – file system must be explicitly mounted (boot environment)
 ●   Can zfs set mountpoint=legacy to use /etc/[v]fstab
 ●   By default, cannot mount on top of non-empty directory
        –    Can override explicitly using zfs mount -O or legacy
              mountpoint
 ●   Mount properties are persistent, use zfs mount -o for temporary
     changes

            Imports are done in parallel, beware of mountpoint races
June 13, 2009                              © 2009 Richard Elling                       138
recordsize

●   Dynamic
     –   Max 128 kBytes
     –   Min 512 Bytes
     –   Power of 2
●   For most workloads, don't worry about it
●   For fixed size workloads, can set to match workloads
     –   Databases
●   File systems or Zvols
●   zfs set recordsize=8k dataset




June 13, 2009                   © 2009 Richard Elling           139
Delegated Administration

●   Fine grain control
     –   users or groups of users
     –   subcommands, parameters, or sets
●   Similar to Solaris' Role Based Access Control (RBAC)
●   Enable/disable at the pool level
     –   zpool set delegation=on mypool (default)
●   Allow/unallow at the dataset level
     –   zfs allow relling snapshot mypool/relling
     –   zfs allow @backupusers snapshot,send mypool/relling
     –   zfs allow mypool/relling



June 13, 2009                       © 2009 Richard Elling      140
Delegation Inheritance
     Beware of inheritance

 ●   Local
        –    zfs allow -l relling snapshot mypool
 ●   Local + descendants
        –    zfs allow -d relling mount mypool




            Make sure permissions are set at the correct level




June 13, 2009                       © 2009 Richard Elling        141
Delegatable Subcommands
 ●   allow                          ●   receive
 ●   clone                          ●   rename
 ●   create                         ●   rollback
 ●   destroy                        ●   send
 ●   groupquota                     ●   share
 ●   groupused                      ●   snapshot
 ●   mount                          ●   userquota
 ●   promote                        ●   userused




June 13, 2009          © 2009 Richard Elling        142
Delegatable Parameters

●   aclinherit        ●   nbmand                     ●   sharenfs
●   aclmode           ●   normalization              ●   sharesmb
●   atime             ●   quota                      ●   snapdir
●   canmount          ●   readonly                   ●   userprop
●   casesensitivity   ●   recordsize                 ●   utf8only
●   checksum          ●   refquota                   ●   version
●   compression       ●   refreservation             ●   volsize
●   copies            ●   reservation                ●   vscan
●   devices           ●   setuid                     ●   xattr
●   exec              ●   shareiscsi                 ●   zoned
●   mountpoint


June 13, 2009                © 2009 Richard Elling                  143
Browser User Interface
 ●   Solaris – WebConsole
 ●   Nexenta -
 ●   OSX -
 ●   OpenStorage -




June 13, 2009               © 2009 Richard Elling   144
Solaris WebConsole




June 13, 2009   © 2009 Richard Elling   145
Solaris WebConsole




June 13, 2009   © 2009 Richard Elling   146
Solaris Swap and Dump
 ●   Swap
        –   Solaris does not have automatic swap resizing
        –   Swap as a separate dataset
        –   Swap device is raw, with a refreservation
        –   Blocksize matched to pagesize
        –   Don't really need or want snapshots or clones
        –   Can resize while online, manually
 ●   Dump
        –   Only used during crash dump
        –   Preallocated
        –   No refreservation
        –   Checksum off
        –   Compression off (dumps are already compressed)

June 13, 2009                     © 2009 Richard Elling      147
Performance



June 13, 2009      © 2009 Richard Elling   148
General Comments
 ●   In general, performs well out of the box
 ●   Standard performance improvement techniques apply
 ●   Lots of DTrace knowledge available
 ●   Typical areas of concern:
        –   ZIL
                  ●   check with zilstat, improve with slogs
        –   COW “fragmentation”
                  ●   check iostat, improve with L2ARC
        –   Memory consumption
                  ●   check with arcstat
                  ●   set primarycache property
                  ●   can be capped
                  ●   can compete with large page aware apps
        –   Compression, or lack thereof
June 13, 2009                              © 2009 Richard Elling   149
ZIL Performance
 ●   Big performance increases demonstrated, especially with SSDs
 ●   NFS servers
        –   32kByte threshold (zfs_immediate_write_sz) also corresponds to
             NFSv3 write size
                ●   May cause more work than needed
                ●   See CR6686887
 ●   Databases
        –   May want different sync policies for logs and data
        –   Current ZIL is pool-wide and enabled for all sync writes
        –   CR6832481 proposes a separate intent log bypass property on a
             per-dataset basis




June 13, 2009                        © 2009 Richard Elling               150
vdev Cache
 ●   vdev cache occurs at the SPA level
        –   readahead
        –   10 MBytes per vdev
        –   only caches metadata (as of b70)
 ●   Stats collected as Solaris kstats



            # kstat -n vdev_cache_stats
            module: zfs                                     instance: 0
            name:   vdev_cache_stats                        class:    misc
                    crtime                                  38.83342625
                    delegations                             14030
                    hits                                    105169
                    misses                                  59452
                    snaptime                                4564628.18130739


                              Hit rate = 59%, not bad...


June 13, 2009                       © 2009 Richard Elling                      151
Intelligent Prefetching
 ●   Intelligent file-level prefetching occurs at the DMU level
 ●   Feeds the ARC
 ●   In a nutshell, prefetch hits cause more prefetching
        –   Read a block, prefetch a block
        –   If we used the prefetched block, read 2 more blocks
        –   Up to 256 blocks
 ●   Recognizes strided reads
        –   2 sequential reads of same length and a fixed distance will be
              coalesced
 ●   Fetches backwards
 ●   Seems to work pretty well, as-is, for most workloads
 ●   Easy to disable in mdb for testing on Solaris
      – echo zfs_prefetch_disable/W0t1 | mdb -kw


June 13, 2009                     © 2009 Richard Elling                      152
I/O Queues
 ●   By default, for devices which can support it, 35 iops are queued to
     each vdev
        –   Tunable with zfs_vdev_max_pending
        –   echo zfs_vdev_max_pending/W0t10 | mdb -kw
 ●   Implies that more vdevs is better
        –   Consider avoiding RAID array with a single, large LUN
 ●   ZFS I/O scheduler loses control once iops are queued
        –   CR6471212 proposes reserved slots for high-priority iops
 ●   May need to match queues for the entire data path
      – zfs_vdev_max_pending
        –   Fibre channel, SCSI, SAS, SATA driver
        –   RAID array controller
 ●   Fast disks → small queues, slow disks → larger queues

June 13, 2009                       © 2009 Richard Elling                  153
COW Penalty
 ●   COW can negatively affect workloads which have updates and
     sequential reads
        –   Initial writes will be sequential
        –   Updates (writes) will cause seeks to read data
 ●   Lots of people seem to worry a lot about this
 ●   Only affects HDDs
 ●   Very difficult to speculate about the impact on real-world apps
        –   Large sequential scans of random data hurt anyway
        –   Reads are cached in many places in the data path
 ●   Sysbench benchmark used to test on MySQL w/InnoDB engine
        –   One hour read/write test
        –   select count(*)
        –   repeat, for a week

June 13, 2009                        © 2009 Richard Elling             154
COW Penalty




                Performance seems to level at about 25% penalty

 Results compliments of Allan Packer & Neelakanth Nadgir
 http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf
June 13, 2009                            © 2009 Richard Elling                155
About Disks...
 ●    Disks still the most important performance bottleneck
        –    Modern processors are multi-core
        –    Default checksums and compression are computationally efficient
                                                   Average
                                Max Size          Rotational
      Disk      Size   RPM      (GBytes)         Latency (ms)     Average Seek (ms)
      HDD       2.5”   5,400      500                   5.5               11
      HDD       3.5”   5,900     2,000                  5.1               16
      HDD       3.5”   7,200     1,500                  4.2             8 - 8.5
      HDD       2.5”   10,000     300                    3             4.2 - 4.6
      HDD       2.5”   15,000     146                    2             3.2 - 3.5
  SSD (w)       2.5”    N/A       73                     0           0.02 - 0.15
     SSD (r)    2.5”    N/A       500                    0           0.02 - 0.15




June 13, 2009                          © 2009 Richard Elling                          156
DirectIO
 ●   UFS forcedirectio option brought the early 1980s design of UFS up to
     the 1990s
 ●   ZFS designed to run on modern multiprocessors
 ●   Databases or applications which manage their data cache may
     benefit by disabling file system caching
 ●   Expect L2ARC to improve random reads


                    UFS DirectIO                             ZFS
            Unbuffered I/O                  primarycache=metadata
                                            primarycache=none
            Concurrency                     Available at inception
            Improved Async I/O code path    Available at inception




June 13, 2009                        © 2009 Richard Elling                 157
Hybrid Storage Pool

                                                  SPA

                          separate log                                L2ARC
                                               Main Pool
                            device                                 cache device




                       Write optimized            HDD
                                                   HDD             Read optimized
                        device (SSD)                HDD             device (SSD)


       Size (GBytes)            < 1 GByte                 large                    big
                Cost            write iops/$              size/$                  size/$
        Performance          low-latency writes              -           low-latency reads



June 13, 2009                             © 2009 Richard Elling                              158
Future Plans
 ●   Announced enhancements
     OpenSolaris Town Hall 2009.06
        –   de-duplication (see also GreenBytes ZFS+)
        –   user quotas (delivered b114)
        –   access-based enumeration
        –   snapshot reference counting
        –   dynamic LUN expansion (delivering b117?)
 ●   Others
        –   mirror to smaller disk (delivered b117)




June 13, 2009                      © 2009 Richard Elling             159
Its a wrap!



                      Thank You!
                       Questions?
                Richard.Elling@RichardElling.com



June 13, 2009              © 2009 Richard Elling            160

Weitere ähnliche Inhalte

Was ist angesagt?

Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage SystemAmdocs
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS WorkshopAPNIC
 
Zfs Nuts And Bolts
Zfs Nuts And BoltsZfs Nuts And Bolts
Zfs Nuts And BoltsEric Sproul
 
An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickeurobsdcon
 
JetStor NAS 724UXD Dual Controller Active-Active ZFS Based
JetStor NAS 724UXD Dual Controller Active-Active ZFS BasedJetStor NAS 724UXD Dual Controller Active-Active ZFS Based
JetStor NAS 724UXD Dual Controller Active-Active ZFS BasedGene Leyzarovich
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...NETWAYS
 
Lavigne bsdmag apr13
Lavigne bsdmag apr13Lavigne bsdmag apr13
Lavigne bsdmag apr13Dru Lavigne
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Introduction to BTRFS and ZFS
Introduction to BTRFS and ZFSIntroduction to BTRFS and ZFS
Introduction to BTRFS and ZFSTsung-en Hsiao
 
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...Gábor Nyers
 

Was ist angesagt? (19)

Zettabyte File Storage System
Zettabyte File Storage SystemZettabyte File Storage System
Zettabyte File Storage System
 
ZFS Workshop
ZFS WorkshopZFS Workshop
ZFS Workshop
 
ZFS
ZFSZFS
ZFS
 
Zfs Nuts And Bolts
Zfs Nuts And BoltsZfs Nuts And Bolts
Zfs Nuts And Bolts
 
An Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusickAn Introduction to the Implementation of ZFS by Kirk McKusick
An Introduction to the Implementation of ZFS by Kirk McKusick
 
ZFS
ZFSZFS
ZFS
 
ZFS Talk Part 1
ZFS Talk Part 1ZFS Talk Part 1
ZFS Talk Part 1
 
Scale2014
Scale2014Scale2014
Scale2014
 
JetStor NAS 724UXD Dual Controller Active-Active ZFS Based
JetStor NAS 724UXD Dual Controller Active-Active ZFS BasedJetStor NAS 724UXD Dual Controller Active-Active ZFS Based
JetStor NAS 724UXD Dual Controller Active-Active ZFS Based
 
Flourish16
Flourish16Flourish16
Flourish16
 
Zfs intro v2
Zfs intro v2Zfs intro v2
Zfs intro v2
 
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
OSDC 2016 - Interesting things you can do with ZFS by Allan Jude&Benedict Reu...
 
Lavigne bsdmag apr13
Lavigne bsdmag apr13Lavigne bsdmag apr13
Lavigne bsdmag apr13
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Introduction to BTRFS and ZFS
Introduction to BTRFS and ZFSIntroduction to BTRFS and ZFS
Introduction to BTRFS and ZFS
 
Fossetcon14
Fossetcon14Fossetcon14
Fossetcon14
 
110629 nexenta- andy bennett
110629   nexenta- andy bennett110629   nexenta- andy bennett
110629 nexenta- andy bennett
 
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
Btrfs and Snapper - The Next Steps from Pure Filesystem Features to Integrati...
 
MySQL on ZFS
MySQL on ZFSMySQL on ZFS
MySQL on ZFS
 

Andere mochten auch

Writing Tools using WebKit
Writing Tools using WebKitWriting Tools using WebKit
Writing Tools using WebKitAriya Hidayat
 
Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)Dustin Kirkland
 
Flash! (Modern File Systems)
Flash! (Modern File Systems)Flash! (Modern File Systems)
Flash! (Modern File Systems)David Evans
 
Ubuntu 16.04 LTS Security Features
Ubuntu 16.04 LTS Security FeaturesUbuntu 16.04 LTS Security Features
Ubuntu 16.04 LTS Security FeaturesDustin Kirkland
 

Andere mochten auch (6)

Writing Tools using WebKit
Writing Tools using WebKitWriting Tools using WebKit
Writing Tools using WebKit
 
B090827
B090827B090827
B090827
 
Open ZFS Keynote (public)
Open ZFS Keynote (public)Open ZFS Keynote (public)
Open ZFS Keynote (public)
 
Flash! (Modern File Systems)
Flash! (Modern File Systems)Flash! (Modern File Systems)
Flash! (Modern File Systems)
 
Ubuntu 16.04 LTS Security Features
Ubuntu 16.04 LTS Security FeaturesUbuntu 16.04 LTS Security Features
Ubuntu 16.04 LTS Security Features
 
Storage devices
Storage devicesStorage devices
Storage devices
 

Ähnlich wie ZFS Tutorial USENIX June 2009

제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustreTommy Lee
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talkUdo Seidel
 
Sharing experience implementing Direct NFS
Sharing experience implementing Direct NFSSharing experience implementing Direct NFS
Sharing experience implementing Direct NFSYury Velikanov
 
BayLISA - FreeNAS 10 by Jordan Hubbard
BayLISA - FreeNAS 10 by Jordan HubbardBayLISA - FreeNAS 10 by Jordan Hubbard
BayLISA - FreeNAS 10 by Jordan HubbardiXsystems
 
Newlug presentation- OpenSolaris
Newlug presentation- OpenSolarisNewlug presentation- OpenSolaris
Newlug presentation- OpenSolarisNEWLUG
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2markleeuw
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrierFlytxt
 
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...IBM India Smarter Computing
 
IT Assist - ZFS on linux
IT Assist - ZFS on linuxIT Assist - ZFS on linux
IT Assist - ZFS on linuxIDG Romania
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Kernel TLV
 
Securing Your Linux System
Securing Your Linux SystemSecuring Your Linux System
Securing Your Linux SystemNovell
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databasesahl0003
 
Webinar - KQStor ZFS port on Linux
Webinar - KQStor ZFS port on LinuxWebinar - KQStor ZFS port on Linux
Webinar - KQStor ZFS port on Linuxkqinfotech
 
pnfs status
pnfs statuspnfs status
pnfs statusbergwolf
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfsRami Jebara
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoopTaldor Group
 

Ähnlich wie ZFS Tutorial USENIX June 2009 (20)

OpenZFS at LinuxCon
OpenZFS at LinuxConOpenZFS at LinuxCon
OpenZFS at LinuxCon
 
제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre제3회난공불락 오픈소스 인프라세미나 - lustre
제3회난공불락 오픈소스 인프라세미나 - lustre
 
Introduction to OpenSolaris 2008.11
Introduction to OpenSolaris 2008.11Introduction to OpenSolaris 2008.11
Introduction to OpenSolaris 2008.11
 
Cephfsglusterfs.talk
Cephfsglusterfs.talkCephfsglusterfs.talk
Cephfsglusterfs.talk
 
Sharing experience implementing Direct NFS
Sharing experience implementing Direct NFSSharing experience implementing Direct NFS
Sharing experience implementing Direct NFS
 
BayLISA - FreeNAS 10 by Jordan Hubbard
BayLISA - FreeNAS 10 by Jordan HubbardBayLISA - FreeNAS 10 by Jordan Hubbard
BayLISA - FreeNAS 10 by Jordan Hubbard
 
Paravirtualized File Systems
Paravirtualized File SystemsParavirtualized File Systems
Paravirtualized File Systems
 
OpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDconOpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDcon
 
Newlug presentation- OpenSolaris
Newlug presentation- OpenSolarisNewlug presentation- OpenSolaris
Newlug presentation- OpenSolaris
 
New Oracle Infrastructure2
New Oracle Infrastructure2New Oracle Infrastructure2
New Oracle Infrastructure2
 
Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
Problem Reporting and Analysis Linux on System z -How to survive a Linux Crit...
 
IT Assist - ZFS on linux
IT Assist - ZFS on linuxIT Assist - ZFS on linux
IT Assist - ZFS on linux
 
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
Emerging Persistent Memory Hardware and ZUFS - PM-based File Systems in User ...
 
Securing Your Linux System
Securing Your Linux SystemSecuring Your Linux System
Securing Your Linux System
 
ZFS for Databases
ZFS for DatabasesZFS for Databases
ZFS for Databases
 
Webinar - KQStor ZFS port on Linux
Webinar - KQStor ZFS port on LinuxWebinar - KQStor ZFS port on Linux
Webinar - KQStor ZFS port on Linux
 
pnfs status
pnfs statuspnfs status
pnfs status
 
Vancouver bug enterprise storage and zfs
Vancouver bug   enterprise storage and zfsVancouver bug   enterprise storage and zfs
Vancouver bug enterprise storage and zfs
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

ZFS Tutorial USENIX June 2009

  • 1. USENIX 2009 ZFS Tutorial Richard.Elling@RichardElling.com
  • 2. Agenda ● Overview ● Foundations ● Pooled Storage Layer ● Transactional Object Layer ● Commands – zpool – zfs ● Sharing ● Properties ● More goodies ● Performance ● Wrap June 13, 2009 © 2009 Richard Elling 2
  • 3. History ● Announced September 14, 2004 ● Integration history – SXCE b27 (November 2005) – FreeBSD (April 2007) – Mac OSX Leopard (~ June 2007) – OpenSolaris 2008.05 – Solaris 10 6/06 (June 2006) – Linux FUSE (summer 2006) – greenBytes ZFS+ (September 2008) ● More than 45 patents, contributed to the CDDL Patents Common June 13, 2009 © 2009 Richard Elling 3
  • 4. Brief List of Features ● Future-proof ● “No silent data corruption ever” ● Cutting-edge data integrity ● “Mind-boggling scalability” ● High performance ● “Breathtaking speed” ● Simplified administration ● “Near zero administration” ● Eliminates need for volume ● “Radical new architecture” managers ● “Greatly simplifies support ● Reduced costs issues” ● Compatibility with POSIX file ● “RAIDZ saves money” system & block devices ● Self-healing Marketing: 2 drink minimum June 13, 2009 © 2009 Richard Elling 4
  • 5. ZFS Design Goals ● Figure out why storage has gotten so complicated ● Blow away 20+ years of obsolete assumptions ● Gotta replace UFS ● Design an integrated system from scratch ● End the suffering June 13, 2009 © 2009 Richard Elling 5
  • 6. Limits 248 — Number of entries in any individual directory 256 — Number of attributes of a f le [1] i 256 — Number of f les in a directory [1] i 16 EiB (264 bytes) — Maximum size of a f le system i 16 EiB — Maximum size of a single file 16 EiB — Maximum size of any attribute 264 — Number of devices in any pool 264 — Number of pools in a system 264 — Number of f le systems in a pool i 264 — Number of snapshots of any f le system i 256 ZiB (278 bytes) — Maximum size of any pool [1] actually constrained to 248 for the number of f les in a ZFS f le system i i June 13, 2009 © 2009 Richard Elling 6
  • 7. Sidetrack: Understanding Builds ● Build is often referenced when speaking of feature/bug integration ● Short-hand notation: b# ● OpenSolaris and SXCE are based on NV ● ZFS development done for NV – Bi-weekly build cycle – Schedule at http://opensolaris.org/os/community/on/schedule/ ● ZFS is ported to Solaris 10 and other OSes June 13, 2009 © 2009 Richard Elling 7
  • 8. Foundations June 13, 2009 © 2009 Richard Elling 8
  • 9. Overhead View of a Pool Pool File System Configuration Information Volume File System Volume Dataset June 13, 2009 © 2009 Richard Elling 9
  • 10. Layer View raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 10
  • 11. Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 11
  • 12. Acronyms ● ARC – Adaptive Replacement Cache ● DMU – Data Management Unit ● DSL – Dataset and Snapshot Layer ● JNI – Java Native InterfaceZPL – ZFS POSIX Layer (traditional file system interface) ● VDEV – Virtual Device layer ● ZAP – ZFS Attribute Processor ● ZIL – ZFS Intent Log ● ZIO – ZFS I/O layer ● Zvol – ZFS volume (raw/cooked block device interface) June 13, 2009 © 2009 Richard Elling 12
  • 13. nvlists ● name=value pairs ● libnvpair(3LIB) ● Allows ZFS capabilities to change without changing the physical on- disk format ● Data stored is XDR encoded ● A good thing, used often June 13, 2009 © 2009 Richard Elling 13
  • 14. Versioning ● Features can be added and identified by nvlist entries ● Change in pool or dataset versions do not change physical on-disk format (!) – does change nvlist parameters ● Older-versions can be used – might see warning messages, but harmless ● Available versions and features can be easily viewed – zpool upgrade -v – zfs upgrade -v ● Online references – zpool: www.opensolaris.org/os/community/zfs/version/N – zfs: www.opensolaris.org/os/community/zfs/version/zpl/N Don't confuse zpool and zfs versions June 13, 2009 © 2009 Richard Elling 14
  • 15. zpool versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit support 15 user and group quotas 16 COMSTAR support June 13, 2009 © 2009 Richard Elling 15
  • 16. zfs versions VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 user and group quotas June 13, 2009 © 2009 Richard Elling 16
  • 17. Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 17
  • 18. COW Notes ● COW works on blocks, not files ● ZFS reserves 32 MBytes or 1/64 of pool size – COWs need some free space to remove files – need space for ZIL ● For fixed-record size workloads “fragmentation” and “poor performance” can occur if the recordsize is not matched ● Spatial distribution is good fodder for performance speculation – affects HDDs – moot for SSDs June 13, 2009 © 2009 Richard Elling 18
  • 19. Pooled Storage Layer raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 19
  • 20. vdevs – Virtual Devices Logical vdevs root vdev top-level vdev top-level vdev children[0] children[1] mirror mirror vdev vdev vdev vdev type=disk type=disk type=disk type=disk children[0] children[1] children[0] children[1] Physical or leaf vdevs June 13, 2009 © 2009 Richard Elling 20
  • 21. vdev Labels ● vdev labels != disk labels ● 4 labels written to every physical vdev ● Label size = 256kBytes ● Two-stage update process – write label0 & label2 – check for errors – write label1 & label3 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block June 13, 2009 © 2009 Richard Elling 21
  • 22. vdev Label Contents 0 256k 512k 4M N-512k N-256k N Boot label0 label1 label2 label3 Block Boot Name=Value Blank Header Pairs 128-slot Uberblock Array 0 8k 16k 128k 256k June 13, 2009 © 2009 Richard Elling 22
  • 23. Observing Labels # zdb -l /dev/rdsk/c0t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name='rpool' state=0 txg=13152 pool_guid=17111649328928073943 hostid=8781271 hostname='' top_guid=11960061581853893368 guid=11960061581853893368 vdev_tree type='disk' id=0 guid=11960061581853893368 path='/dev/dsk/c0t0d0s0' devid='id1,sd@SATA_____ST3500320AS_________________9QM3FWFT/a' phys_path='/pci@0,0/pci1458,b002@11/disk@0,0:a' whole_disk=0 metaslab_array=24 metaslab_shift=30 ashift=9 asize=157945167872 is_log=0 June 13, 2009 © 2009 Richard Elling 23
  • 24. Uberblocks ● 1 kByte ● Stored in 128-entry circular queue ● Only one uberblock is active at any time – highest transaction group number – correct SHA-256 checksum ● Stored in machine's native format – A magic number is used to determine endian format when imported ● Contains pointer to MOS June 13, 2009 © 2009 Richard Elling 24
  • 25. MOS – Meta Object Set ● Only one MOS per pool ● Contains object directory pointers – root_dataset – references all top-level datasets in the pool – config – nvlist describing the pool configuration – sync_bplist – list of block pointers which need to be freed during the next transaction June 13, 2009 © 2009 Richard Elling 25
  • 26. Block Pointers ● blkptr_t structure ● 128 bytes ● contents: – 3x data virtual address (DVA) – endianess – level of indirection – DMU object type – checksum function – compression function – physical size – logical size – birth txg – fill count – checksum (256 bits) June 13, 2009 © 2009 Richard Elling 26
  • 27. DVA – Data Virtual Address ● Contains – vdev id – offset in sectors – grid (future) – allocated size – gang block indicator ● Physical block address = (offset << 9) + 4 MBytes June 13, 2009 © 2009 Richard Elling 27
  • 28. Gang Blocks ● Gang blocks contain block pointers ● Used when space requested is not available in a contiguous block ● 512 bytes ● self checksummed ● contains 3 block pointers June 13, 2009 © 2009 Richard Elling 28
  • 29. To fsck or not to fsck ● fsck was created to fix known inconsistencies in file system metadata – UFS is not transactional – metadata inconsistencies must be reconciled – does NOT repair data – how could it? ● ZFS doesn't need fsck, as-is – all on-disk changes are transactional – COW means previously existing, consistent metadata is not overwritten – ZFS can repair itself ● metadata is at least dual-redundant ● data can also be redundant ● Reality check – this does not mean that ZFS is not susceptible to corruption – nor is any other file system June 13, 2009 © 2009 Richard Elling 29
  • 30. VDEV June 13, 2009 © 2009 Richard Elling 30
  • 31. Dynamic Striping ● RAID-0 – SNIA definition: fixed-length sequences of virtual disk data addresses are mapped to sequences of member disk addresses in a regular rotating pattern ● Dynamic Stripe – Data is dynamically mapped to member disks – No fixed-length sequences – Allocate up to ~1 MByte/vdev before changing vdev – vdevs can be different size – Good combination of the concatenation feature with RAID-0 performance June 13, 2009 © 2009 Richard Elling 31
  • 32. Dynamic Striping RAID-0 Column size = 128 kBytes, stripe width = 384 kBytes ZFS Dynamic Stripe recordsize = 128 kBytes Total write size = 2816 kBytes June 13, 2009 © 2009 Richard Elling 32
  • 33. Mirroring ● Straightforward: put N copies of the data on N vdevs ● Unlike RAID-1 – No 1:1 mapping at the block level – vdev labels are still at beginning and end – vdevs can be of different size ● effective space is that of smallest vdev ● Arbitration: ZFS does not blindly trust either side of mirror – Most recent, correct view of data wins – Checksums validate data June 13, 2009 © 2009 Richard Elling 33
  • 34. Mirroring June 13, 2009 © 2009 Richard Elling 34
  • 35. Dynamic vdev Replacement ● zpool replace poolname vdev [vdev] ● Today, replacing vdev must be same size or larger (as measured by blocks) ● Replacing all vdevs in a top-level vdev with larger vdevs results in (automatic?) top-level vdev resizing 15G 10G 10G 15G 10G 20G 15G 20G 10G 15G 20G 20G 20G 20G 10G 10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror June 13, 2009 © 2009 Richard Elling 35
  • 36. RAIDZ ● RAID-5 – Parity check data is distributed across the RAID array's disks – Must read/modify/write when data is smaller than stripe width ● RAIDZ – Dynamic data placement – Parity added as needed – Writes are full-stripe writes – No read/modify/write (write hole) ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If checksum fails, read parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 36
  • 37. RAID-5 vs RAIDZ DiskA DiskB DiskC DiskD DiskE D0:0 D0:1 D0:2 D0:3 P0 RAID-5 P1 D1:0 D1:1 D1:2 D1:3 D2:3 P2 D2:0 D2:1 D2:2 D3.2 D3:3 P3 D3:0 D3:1 DiskA DiskB DiskC DiskD DiskE P0 D0:0 D0:1 D0:2 D0:3 RAIDZ P1 D1:0 D1:1 P2:0 D2:0 D2:1 D2:2 D2:3 P2:1 D2:4 D2:5 June 13, 2009 © 2009 Richard Elling 37
  • 38. RAID-5 Write Hole ● Occurs when data to be written is smaller than stripe size ● Must read unallocated columns to recalculate the parity or the parity must be read/modify/write ● Read/modify/write is risky for consistency – Multiple disks – Reading independently – Writing independently – System failure before all writes are complete to media could result in data loss ● Effects can be hidden from host using RAID array with nonvolatile write cache, but extra I/O cannot be hidden from disks June 13, 2009 © 2009 Richard Elling 38
  • 39. RAIDZ2 ● RAIDZ2 = double parity RAIDZ – Can recover data if any 2 leaf vdevs fail ● Sorta like RAID-6 – Parity 1: XOR – Parity 2: another Reed-Soloman syndrome ● More computationally expensive than RAIDZ ● Arbitration: ZFS does not blindly trust any device – Does not rely on disk reporting read error – Checksums validate data – If data not valid, read parity – If data still not valid, read other parity Space used is dependent on how used June 13, 2009 © 2009 Richard Elling 39
  • 40. Evaluating Data Retention ● MTTDL = Mean Time To Data Loss ● Note: MTBF is not constant in the real world, but keeps math simple ● MTTDL[1] is a simple MTTDL model ● No parity (single vdev, striping, RAID-0) – MTTDL[1] = MTBF / N ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[1] = MTBF2 / (N * (N-1) * MTTR) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2) June 13, 2009 © 2009 Richard Elling 40
  • 41. Another MTTDL Model ● MTTDL[1] model doesn't take into account unrecoverable read ● But unrecoverable reads (UER) are becoming the dominant failure mode – UER specifed as errors per bits read – More bits = higher probability of loss per vdev ● MTTDL[2] model considers UER June 13, 2009 © 2009 Richard Elling 41
  • 42. Why Worry about UER? ● Richard's study – 3,684 hosts with 12,204 LUNs – 11.5% of all LUNs reported read errors ● Bairavasundaram et.al. FAST08 www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf – 1.53M LUNs over 41 months – RAID reconstruction discovers 8% of checksum mismatches – 4% of disks studies developed checksum errors over 17 months June 13, 2009 © 2009 Richard Elling 42
  • 43. Why Worry about UER? ● RAID array study June 13, 2009 © 2009 Richard Elling 43
  • 44. MTTDL[2] Model ● Probability that a reconstruction will fail – Precon_fail = (N-1) * size / UER ● Model doesn't work for non-parity schemes (single vdev, striping, RAID-0) ● Single Parity (mirror, RAIDZ, RAID-1, RAID-5) – MTTDL[2] = MTBF / (N * Precon_fail) ● Double Parity (3-way mirror, RAIDZ2, RAID-6) – MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail) June 13, 2009 © 2009 Richard Elling 44
  • 45. Practical View of MTTDL[1] June 13, 2009 © 2009 Richard Elling 45
  • 46. MTTDL Models: Mirror June 13, 2009 © 2009 Richard Elling 46
  • 47. MTTDL Models: RAIDZ2 June 13, 2009 © 2009 Richard Elling 47
  • 48. Ditto Blocks ● Recall that each blkptr_t contains 3 DVAs ● Allows up to 3 physical copies of the data ZFS copies parameter Data copies Metadata copies default 1 2 copies=2 2 3 copies=3 3 3 June 13, 2009 © 2009 Richard Elling 48
  • 49. Copies ● Dataset property used to indicate how many copies (aka ditto blocks) of data is desired – Write all copies – Read any copy – Recover corrupted read from a copy ● By default – data copies=1 – metadata copies=data copies +1 or max=3 ● Not a replacement for mirroring ● Easier to describe in pictures... June 13, 2009 © 2009 Richard Elling 49
  • 50. Copies in Pictures June 13, 2009 © 2009 Richard Elling 50
  • 51. Copies in Pictures June 13, 2009 © 2009 Richard Elling 51
  • 52. ZIO – ZFS I/O Layer June 13, 2009 © 2009 Richard Elling 52
  • 53. ZIO Framework ● All physical disk I/O goes through ZIO Framework ● Translates DVAs into Logical Block Address (LBA) on leaf vdevs – Keeps free space maps (spacemap) – If contiguous space is not available: ● Allocate smaller blocks (the gang) ● Allocate gang block, pointing to the gang ● Implemented as multi-stage pipeline – Allows extensions to be added fairly easily ● Handles I/O errors June 13, 2009 © 2009 Richard Elling 53
  • 54. SpaceMap from Space June 13, 2009 © 2009 Richard Elling 54
  • 55. ZIO Write Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open compress if savings > 12.5% encrypt generate allocate start start start done done done assess assess assess done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 55
  • 56. ZIO Read Pipeline ZIO State Compression Crypto Checksum DVA vdev I/O open start start start done done done assess assess assess verify decrypt decompress done Gang activity elided, for clarity June 13, 2009 © 2009 Richard Elling 56
  • 57. VDEV – Virtual Device Subsytem ● Where mirrors, RAIDZ, and RAIDZ2 are implemented – Surprisingly few lines of code needed to implement RAID ● Leaf vdev (physical device) I/O management – Number of outstanding iops – Read-ahead cache Name Priority NOW 0 ● Priority scheduling SYNC_READ 0 SYNC_WRITE 0 FREE 0 CACHE_FILL 0 LOG_WRITE 0 ASYNC_READ 4 ASYNC_WRITE 4 RESILVER 10 SCRUB 20 June 13, 2009 © 2009 Richard Elling 57
  • 58. ARC – Adaptive Replacement Cache June 13, 2009 © 2009 Richard Elling 58
  • 59. Object Cache ● UFS uses page cache managed by the virtual memory system ● ZFS does not use the page cache, except for mmap'ed files ● ZFS uses a Adaptive Replacement Cache (ARC) ● ARC used by DMU to cache DVA data objects ● Only one ARC per system, but caching policy can be changed on a per-dataset basis ● Seems to work much better than page cache ever did for UFS June 13, 2009 © 2009 Richard Elling 59
  • 60. Traditional Cache ● Works well when data being accessed was recently added ● Doesn't work so well when frequently accessed data is evicted Misses cause insert MRU Dynamic caches can change Cache size size by either not evicting or aggressively evicting LRU Evict the oldest June 13, 2009 © 2009 Richard Elling 60
  • 61. ARC – Adaptive Replacement Cache Evict the oldest single-use entry LRU Recent Cache Miss MRU Evictions and dynamic MRU resizing needs to choose best Hit size cache to evict (shrink) Frequent Cache LRU Evict the oldest multiple accessed entry June 13, 2009 © 2009 Richard Elling 61
  • 62. ZFS ARC – Adaptive Replacement Cache with Locked Pages Evict the oldest single-use entry Cannot evict LRU locked pages! Recent Cache Miss MRU MRU Hit size Frequent If hit occurs Cache within 62 ms LRU Evict the oldest multiple accessed entry ZFS ARC handles mixed-size pages June 13, 2009 © 2009 Richard Elling 62
  • 63. ARC Directory ● Each ARC directory entry contains arc_buf_hdr structs – Info about the entry – Pointer to the entry ● Directory entries have size, ~200 bytes ● ZFS block size is dynamic, 512 bytes – 128 kBytes ● Disks are large ● Suppose we use a Seagate LP 2 TByte disk for the L2ARC – Disk has 3,907,029,168 512 byte sectors, guaranteed – Workload uses 8 kByte fixed record size – RAM needed for arc_buf_hdr entries ● Need = 3,907,029,168 * 200 / 16 = 45 GBytes ● Don't underestimate the RAM needed for large L2ARCs June 13, 2009 © 2009 Richard Elling 63
  • 64. L2ARC – Level 2 ARC ● ARC evictions are sent to cache vdev ● ARC directory remains in memory ARC ● Works well when cache vdev is optimized for fast reads – lower latency than pool disks evicted – inexpensive way to “increase memory” data ● Content considered volatile, no ZFS data protection allowed ● Monitor usage with zpool iostat “cache” “cache” “cache” vdev vdev vdev June 13, 2009 © 2009 Richard Elling 64
  • 65. ARC Tips ● In general, it seems to work well for most workloads ● ARC size will vary, based on usage ● Internals tracked by kstats in Solaris – Use memory_throttle_count to observe pressure to evict ● Can limit at boot time – Solaris – set zfs:zfs_arc_max in /etc/system ● Performance – Prior to b107, L2ARC fill rate was limited to 8 MBytes/s L2ARC keeps its directory in kernel memory June 13, 2009 © 2009 Richard Elling 65
  • 66. Transactional Object Layer June 13, 2009 © 2009 Richard Elling 66
  • 67. flash Source Code Structure File system Device GUI Mgmt Consumer Consumer JNI User libzfs Kernel Interface ZPL ZVol /dev/zfs Layer Transactional ZIL ZAP Traversal Object Layer DMU DSL ARC Pooled Storage ZIO Layer VDEV Configuration June 13, 2009 © 2009 Richard Elling 67
  • 68. ZAP – ZFS Attribute Processor ● Module sits on top of DMU ● Important component for managing everything ● Operates on ZAP objects – Contain name=value pairs ● FatZAP – Flexible architecture for storing large numbers of attributes ● MicroZAP – Lightweight version of fatzap – Uses 1 block – All name=value pairs must fit in block – Names <= 50 chars (including NULL terminator) – Values are type uint64_t June 13, 2009 © 2009 Richard Elling 68
  • 69. DMU – Data Management Layer ● Datasets issue transactions to the DMU ● Transactional based object model ● Transactions are – Atomic – Grouped (txg = transaction group) ● Responsible for on-disk data ● ZFS Attribute Processor (ZAP) ● Dataset and Snapshot Layer (DSL) ● ZFS Intent Log (ZIL) June 13, 2009 © 2009 Richard Elling 69
  • 70. Transaction Engine ● Manages physical I/O ● Transactions grouped into transaction group (txg) – txg updates – All-or-nothing – Commit interval ● Older versions: 5 seconds (zfs_ ● Now: 30 seconds max, dynamically scale based on time required to commit txg ● Delay committing data to physical storage – Improves performance – A bad thing for sync workloads – hence the ZFS Intent Log (ZIL) 30 second delay could impact failure detection time June 13, 2009 © 2009 Richard Elling 70
  • 71. ZIL – ZFS Intent Log ● DMU is transactional, and likes to group I/O into transactions for later commits, but still needs to handle “write it now” desire of sync writers – NFS – Databases ● If I/O < 32 kBytes – write it (now) to ZIL (allocated from pool) – write it later as part of the txg commit ● If I/O > 32 kBytes, write it to pool now – Should be faster for large, sequential writes ● Never read, except at import (eg reboot), when transactions may need to be rolled forward June 13, 2009 © 2009 Richard Elling 71
  • 72. Separate Logs (slogs) ● ZIL competes with pool for iops – Applications will wait for sync writes to be on nonvolatile media – Very noticeable on HDD JBODs ● Put ZIL on separate vdev, outside of pool – ZIL writes tend to be sequential – No competition with pool for iops – Downside: slog device required to be operational at import ● 10x or more performance improvements possible – Better if using write-optimized SSD or non-volatile write cache on RAID array ● Use zilstat to observe ZIL activity June 13, 2009 © 2009 Richard Elling 72
  • 73. DSL – Dataset and Snapshot Layer June 13, 2009 © 2009 Richard Elling 73
  • 74. flash Copy on Write 1. Initial block tree 2. COW some data 3. COW metadata 4. Update Uberblocks & free June 13, 2009 © 2009 Richard Elling 74
  • 75. zfs snapshot ● Create a read-only, point-in-time window into the dataset (file system or Zvol) ● Computationally free, because of COW architecture ● Very handy feature – Patching/upgrades – Basis for Time Slider June 13, 2009 © 2009 Richard Elling 75
  • 76. Snapshot Snapshot tree root Current tree root ● Create a snapshot by not free'ing COWed blocks ● Snapshot creation is fast and easy ● Number of snapshots determined by use – no hardwired limit ● Recursive snapshots also possible June 13, 2009 © 2009 Richard Elling 76
  • 77. Clones ● Snapshots are read-only ● Clones are read-write based upon a snapshot ● Child depends on parent – Cannot destroy parent without destroying all children – Can promote children to be parents ● Good ideas – OS upgrades – Change control – Replication ● zones ● virtual disks June 13, 2009 © 2009 Richard Elling 77
  • 78. zfs clone ● Create a read-write file system from a read-only snapshot ● Used extensively for OpenSolaris upgrades OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 OS rev1 snapshot snapshot snapshot OS rev1 upgrade OS rev2 clone boot manager Origin snapshot cannot be destroyed, if clone exists June 13, 2009 © 2009 Richard Elling 78
  • 79. zfs promote OS b104 OS b104 clone OS rev1 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b105 OS rev2 snapshot snapshot snapshot rpool/ROOT/b104@today rpool/ROOT/b105@today OS b105 OS b105 clone promote OS rev2 rpool/ROOT/b105 rpool/ROOT/b105 June 13, 2009 © 2009 Richard Elling 79
  • 80. zfs rollback OS b104 OS b104 rpool/ROOT/b104 rpool/ROOT/b104 OS b104 OS b104 snapshot rollback snapshot rpool/ROOT/b104@today rpool/ROOT/b104@today June 13, 2009 © 2009 Richard Elling 80
  • 81. Commands June 13, 2009 © 2009 Richard Elling 81
  • 82. zpool(1m) raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer Block Device Driver HDD SSD iSCSI ?? June 13, 2009 © 2009 Richard Elling 82
  • 83. Dataset & Snapshot Layer ● Object – Allocated storage – dnode describes collection Dataset Directory of blocks Dataset ● Object Set Object Set Childmap – Group of related objects ● Dataset Object Object Object Properties – Snapmap: snapshot relationships Snapmap – Space usage ● Dataset directory – Childmap: dataset relationships – Properties June 13, 2009 © 2009 Richard Elling 83
  • 84. zpool create ● zpool create poolname vdev-configuration – vdev-configuration examples ● mirror c0t0d0 c3t6d0 ● mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6 ● mirror disk1s0 disk2s0 cache disk4s0 log disk5 ● raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0 ● Solaris – Additional checks to see if disk/slice overlaps or is currently in use – Whole disks are given EFI labels ● Can set initial pool or dataset properties ● By default, creates a file system with the same name – poolname pool → /poolname file system People get confused by a file system with same name as the pool June 13, 2009 © 2009 Richard Elling 84
  • 85. zpool destroy ● Destroy the pool and all datasets therein ● zpool destroy poolname ● Can (try to) force with “-f” ● There is no “are you sure?” prompt – if you weren't sure, you would not have typed “destroy” zpool destroy is destructive... really! Use with caution! June 13, 2009 © 2009 Richard Elling 85
  • 86. zpool add ● Adds a device to the pool as a top-level vdev ● zpool add poolname vdev-configuration ● vdev-configuration can be any combination also used for zpool create ● Complains if the added vdev-configuration would cause a different data protection scheme than is already in use – use “-f” to override ● Good idea: try with “-n” flag first – will show final configuration without actually performing the add Do not add a device which is in use as a quorum device June 13, 2009 © 2009 Richard Elling 86
  • 87. zpool remove ● Remove a top-level vdev from the pool ● zpool remove poolname vdev ● Today, you can only remove the following vdevs: – cache – hot spare ● An RFE is open to allow removal of other top-level vdevs Don't confuse “remove” with “detach” June 13, 2009 © 2009 Richard Elling 87
  • 88. zpool attach ● Attach a vdev as a mirror to an existing vdev ● zpool attach poolname existing-vdev vdev ● Attaching vdev must be the same size or larger than the existing vdev ● Note: today this is not available for RAIDZ or RAIDZ2 vdevs vdev Configurations ok simple vdev → mirror ok mirror ok log → mirrored log no RAIDZ no RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 88
  • 89. zpool detach ● Detach a vdev from a mirror ● zpool detach poolname vdev ● A resilvering vdev will wait until resilvering is complete June 13, 2009 © 2009 Richard Elling 89
  • 90. zpool replace ● Replaces an existing vdev with a new vdev ● zpool replace poolname existing-vdev vdev ● Effectively, a shorthand for “zpool attach” followed by “zpool detach” ● Attaching vdev must be the same size or larger than the existing vdev ● Works for any top-level vdev-configuration, including RAIDZ and RAIDZ2 vdev Configurations ok simple vdev ok mirror ok log ok RAIDZ ok RAIDZ2 “Same size” literally means the same number of blocks. Beware that many “same size” disks have different number of available blocks. June 13, 2009 © 2009 Richard Elling 90
  • 91. zpool import ● Import a pool and mount all mountable datasets ● Import a specific pool – zpool import poolname – zpool import GUID ● Scan LUNs for pools which may be imported – zpool import ● Can set options, such as alternate root directory or other properties Beware of zpool.cache interactions Beware of artifacts, especially partial artifacts June 13, 2009 © 2009 Richard Elling 91
  • 92. zpool export ● Unmount datasets and export the pool ● zpool export poolname ● Removes pool entry from zpool.cache June 13, 2009 © 2009 Richard Elling 92
  • 93. zpool upgrade ● Display current versions – zpool upgrade ● View available upgrade versions, with features, but don't actually upgrade – zpool upgrade -v ● Upgrade pool to latest version – zpool upgrade poolname ● Upgrade pool to specific version – zpool upgrade -V version poolname Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 93
  • 94. zpool history ● Show history of changes made to the pool # zpool history rpool History for 'rpool': 2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0 2009-03-04.07:29:47 zfs set canmount=noauto rpool 2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool 2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT 2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap 2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump 2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106 2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106 2009-03-04.07:29:51 zfs set canmount=on rpool 2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export 2009-03-04.07:29:51 zfs create rpool/export/home 2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943 2009-03-04.00:21:42 zpool export rpool 2009-03-04.08:47:08 zpool set bootfs=rpool rpool 2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool 2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108 2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108 ... June 13, 2009 © 2009 Richard Elling 94
  • 95. zpool status ● Shows the status of the current pools, including their configuration ● Important troubleshooting step # zpool status … pool: stuff state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM stuff ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0s0 ONLINE 0 0 0 c0t0d0s7 ONLINE 0 0 0 errors: No known data errors Understanding status output error messages can be tricky June 13, 2009 © 2009 Richard Elling 95
  • 96. zpool clear ● Clears device errors ● Clears device error counters ● Improves sysadmin sanity and reduces sweating June 13, 2009 © 2009 Richard Elling 96
  • 97. zpool iostat ● Show pool physical I/O activity, in an iostat-like manner ● Solaris: fsstat will show I/O activity looking into a ZFS file system ● Especially useful for showing slog activity # zpool iostat -v capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- rpool 16.5G 131G 0 0 1.16K 2.80K c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K ------------ ----- ----- ----- ----- ----- ----- stuff 135G 14.4G 0 5 2.09K 27.3K mirror 135G 14.4G 0 5 2.09K 27.3K c0t2d0s0 - - 0 3 1.25K 27.5K c0t0d0s7 - - 0 2 1.27K 27.5K ------------ ----- ----- ----- ----- ----- ----- Unlike iostat, does not show latency June 13, 2009 © 2009 Richard Elling 97
  • 98. zpool scrub ● Manually starts scrub – zpool scrub poolname ● Scrubbing performed in background ● Use zpool status to track scrub progress ● Stop scrub – zpool scrub -s poolname Estimated scrub completion time improves over time June 13, 2009 © 2009 Richard Elling 98
  • 99. zfs(1m) ● Manages file systems (ZPL) and Zvols ● Can proxy to other, related commands – iSCSI, NFS, CIFS raw swap dump iSCSI ?? ZFS NFS CIFS ?? ZFS Volume Emulator (Zvol) ZFS POSIX Layer (ZPL) pNFS Lustre ?? Transactional Object Layer Pooled Storage Layer June 13, 2009 © 2009 Richard Elling 99
  • 100. zfs create, destroy ● By default, a file system with the same name as the pool is created by zpool create ● Name format is: pool/name[/name ...] ● File system – zfs create fs-name – zfs destroy fs-name ● Zvol – zfs create -V size vol-name – zfs destroy vol-name ● Parameters can be set at create time June 13, 2009 © 2009 Richard Elling 100
  • 101. zfs mount, unmount ● Note: mount point is a file system parameter – zfs get mountpoint fs-name ● Rarely used subcommand (!) ● Display mounted file systems – zfs mount ● Mount a file system – zfs mount fs-name – zfs mount -a ● Unmount – zfs unmount fs-name – zfs unmount -a June 13, 2009 © 2009 Richard Elling 101
  • 102. zfs list ● List mounted datasets ● Old versions: listed everything ● New versions: do not list snapshots ● Examples – zfs list – zfs list -t snapshot – zfs list -H -o name June 13, 2009 © 2009 Richard Elling 102
  • 103. zfs send, receive ● Send – send a snapshot to stdout – data is decompressed ● Receive – receive a snapshot from stdin – receiving file system parameters apply (compression, et.al) ● Can incrementally send snapshots in time order ● Handy way to replicate dataset snapshots ● NOT a replacement for traditional backup solutions – All-or-nothing design per snapshot – In general, does not send files (!) – Today, no per-file management Send streams from b35 (or older) no longer supported after b89 June 13, 2009 © 2009 Richard Elling 103
  • 104. zfs rename ● Renames a file system, volume,or snapshot – zfs rename export/home/relling export/home/richard June 13, 2009 © 2009 Richard Elling 104
  • 105. zfs upgrade ● Display current versions – zfs upgrade ● View available upgrade versions, with features, but don't actually upgrade – zfs upgrade -v ● Upgrade pool to latest version – zfs upgrade dataset ● Upgrade pool to specific version – zfs upgrade -V version dataset Once you upgrade, there is no downgrade June 13, 2009 © 2009 Richard Elling 105
  • 106. Sharing June 13, 2009 © 2009 Richard Elling 106
  • 107. Sharing ● zfs share dataset ● Type of sharing set by parameters – shareiscsi = [on | off] – sharenfs = [on | off | options] – sharesmb = [on | off | options] ● Shortcut to manage sharing – Uses external services (nfsd, COMSTAR, etc) – Importing pool will also share ● May vary by OS June 13, 2009 © 2009 Richard Elling 107
  • 108. NFS ● ZFS file systems work as expected – use ACLs based on NFSv4 ACLs ● Parallel NFS, aks pNFS, aka NFSv4.1 – Still a work-in-progress – http://opensolaris.org/os/project/nfsv41/ – zfs create -t pnfsdata mypnfsdata pNFS Client pNFS Data Server pNFS Data Server pnfsdata pnfsdata pNFS dataset dataset Metadata Server pool pool June 13, 2009 © 2009 Richard Elling 108
  • 109. CIFS ● UID mapping ● casesensitivity parameter – Good idea, set when file system is created – zfs create -o casesensitivity=insensitive mypool/Shared ● Shadow Copies for Shared Folders (VSS) supported – CIFS clients cannot create shadow remotely (yet) CIFS features vary by OS, Samba, etc. June 13, 2009 © 2009 Richard Elling 109
  • 110. iSCSI ● SCSI over IP ● Block-level protocol ● Uses Zvols as storage ● Solaris has 2 iSCSI target implementations – shareiscsi enables old, klunky iSCSI target – To use COMSTAR, enable using itadm(1m) – b116 adds COMSTAR support (zpool version 16) June 13, 2009 © 2009 Richard Elling 110
  • 111. Properties June 13, 2009 © 2009 Richard Elling 111
  • 112. Properties ● Properties are stored in an nvlist ● By default, are inherited ● Some properties are common to all datasets, but a specific dataset type may have additional properties ● Easily set or retrieved via scripts ● In general, properties affect future file system activity zpool get doesn't script as nicely as zfs get June 13, 2009 © 2009 Richard Elling 112
  • 113. User-defined Properties ● Names – Must include colon ':' – Can contain lower case alphanumerics or “+” “.” “_” – Max length = 256 characters – By convention, module:property ● com.sun:auto-snapshot ● Values – Max length = 1024 characters ● Examples – com.sun:auto-snapshot=true – com.richardelling:important_files=true June 13, 2009 © 2009 Richard Elling 113
  • 114. set & get properties ● Set – zfs set compression=on export/home/relling ● Get – zfs get compression export/home/relling ● Reset to inherited value – zfs inherit compression export/home/relling ● Clear user-defined parameter – zfs inherit com.sun:auto-snapshot export/home/relling June 13, 2009 © 2009 Richard Elling 114
  • 115. Pool Properties Property Change? Brief Description altroot Alternate root directory (ala chroot) autoreplace vdev replacement policy available readonly Available storage space bootfs Default bootable dataset for root pool cachefile Cache file to use other than /etc/zfs/zpool.cache capacity readonly Percent of pool space used delegation Master pool delegation switch failmode Catastrophic pool failure policy guid readonly Unique identifier health readonly Current health of the pool listsnapshots zfs list policy size readonly Total size of pool used readonly Amount of space used version readonly Current on-disk version June 13, 2009 © 2009 Richard Elling 115
  • 116. Common Dataset Properties Property Change? Brief Description available readonly Space available to dataset & children checksum Checksum algorithm compression Compression algorithm compressratio readonly Compression ratio – logical size:referenced physical copies Number of copies of user data creation readonly Dataset creation time origin readonly For clones, origin snapshot primarycache ARC caching policy readonly Is dataset in readonly mode? referenced readonly Size of data accessible by this dataset refreservation Max space guaranteed to a dataset, not including descendants (snapshots & clones) reservation Minimum space guaranteed to dataset, including descendants June 13, 2009 © 2009 Richard Elling 116
  • 117. Common Dataset Properties Property Change? Brief Description secondarycache L2ARC caching policy type readonly Type of dataset (filesystem, snapshot, volume) used readonly Sum of usedby* (see below) usedbychildren readonly Space used by descendants usedbydataset readonly Space used by dataset usedbyrefreservation readonly Space used by a refreservation for this dataset usedbysnapshots readonly Space used by all snapshots of this dataset zoned readonly Is dataset added to non-global zone (Solaris) June 13, 2009 © 2009 Richard Elling 117
  • 118. File System Dataset Properties Property Change? Brief Description aclinherit ACL inheritance policy, when files or directories are created aclmode ACL modification policy, when chmod is used atime Disable access time metadata updates canmount Mount policy casesensitivity creation Filename matching algorithm devices Device opening policy for dataset exec File execution policy for dataset mounted readonly Is file system currently mounted? nbmand export/ File system should be mounted with non-blocking import mandatory locks (CIFS client feature) normalization creation Unicode normalization of file names for matching June 13, 2009 © 2009 Richard Elling 118
  • 119. File System Dataset Properties Property Change? Brief Description quota Max space dataset and descendants can consume recordsize Suggested maximum block size for files refquota Max space dataset can consume, not including descendants setuid setuid mode policy sharenfs NFS sharing options sharesmb CIFS sharing options snapdir Controls whether .zfs directory is hidden utf8only UTF-8 character file name policy vscan Virus scan enabled xattr Extended attributes policy June 13, 2009 © 2009 Richard Elling 119
  • 120. More Goodies... June 13, 2009 © 2009 Richard Elling 120
  • 121. Dataset Space Accounting ● used = usedbydataset + usedbychildren + usedbysnapshots + usedbyrefreservation ● Lazy updates, may not be correct until txg commits ● ls and du will show size of allocated files which includes all copies of a file ● Shorthand report available $ zfs list -o space NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD rpool 126G 18.3G 0 35.5K 0 18.3G rpool/ROOT 126G 15.3G 0 18K 0 15.3G rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0 rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0 rpool/dump 126G 1.00G 0 1.00G 0 0 rpool/export 126G 37K 0 19K 0 18K rpool/export/home 126G 18K 0 18K 0 0 rpool/swap 128G 2G 0 193M 1.81G 0 June 13, 2009 © 2009 Richard Elling 121
  • 122. zfs vs zpool Space Accounting ● zfs list != zpool list ● zfs list shows space used by the dataset plus space for internal accounting ● zpool list shows physical space available to the pool ● For simple pools and mirrors, they are nearly the same ● For RAIDZ or RAIDZ2, zpool list will show space available for parity Users will be confused about reported space available June 13, 2009 © 2009 Richard Elling 122
  • 123. Testing ● ztest ● fstest June 13, 2009 © 2009 Richard Elling 123
  • 124. Accessing Snapshots ● By default, snapshots are accessible in .zfs directory ● Visibility of .zfs directory is tunable via snapdir property – Don't really want find to find the .zfs directory ● Windows CIFS clients can see snapshots as Shadow Copies for Shared Folders (VSS) # zfs snapshot rpool/export/home/relling@20090415 # ls -a /export/home/relling … .Xsession .xsession-errors # ls /export/home/relling/.zfs shares snapshot # ls /export/home/relling/.zfs/snapshot 20090415 # ls /export/home/relling/.zfs/snapshot/20090415 Desktop Documents Downloads Public June 13, 2009 © 2009 Richard Elling 124
  • 125. Resilver & Scrub ● Can be read iops bound ● Resilver can also be bandwidth bound to the resilvering device ● Both work at lower I/O scheduling priority than normal work, but that may not matter for read iops bound devices June 13, 2009 © 2009 Richard Elling 125
  • 126. Time-based Resilvering ● Block pointers contain birth txg number ● Resilvering begins with oldest blocks first 73 73 ● Interrupted resilver will still result in a valid file system view 73 55 73 27 68 73 27 27 73 68 Birth txg = 27 Birth txg = 68 Birth txg = 73 June 13, 2009 © 2009 Richard Elling 126
  • 127. Time Slider – Automatic Snapshots ● Underpinnings for Solaris feature similar to OSX's Time Machine ● SMF service for managing snapshots ● SMF properties used to specify policies – Frequency – Number to keep ● Creates cron jobs ● GUI tool makes it easy to select individual file systems Service Name Interval (default) Keep (default) auto-snapshot:frequent 15 minutes 4 auto-snapshot:hourly 1 hour 24 auto-snapshot:daily 1 day 31 auto-snapshot:weekly 7 days 4 auto-snapshot:monthly 1 month 12 June 13, 2009 © 2009 Richard Elling 127
  • 128. Nautilus ● File system views which can go back in time June 13, 2009 © 2009 Richard Elling 128
  • 129. ACL – Access Control List ● Based on NFSv4 ACLs ● Similar to Windows NT ACLs ● Works well with CIFS services ● Supports ACL inheritance ● Change using chmod ● View using ls June 13, 2009 © 2009 Richard Elling 129
  • 130. Checksums ● DVA contains 256 bits for checksum ● Checksum is in the parent, not in the block itself ● Types – none – fletcher2: truncated Fletcher algorithm – fletcher4: full Fletcher algorithm – SHA-256 ● There are open proposals for better algorithms June 13, 2009 © 2009 Richard Elling 130
  • 131. Checksum Use Use Algorithm Notes Uberblock SHA-256 self-checksummed Metadata fletcher4 Labels SHA-256 Data fletcher2 (default) zfs compression parameter ZIL log fletcher2 self-checksummed Gang block SHA-256 self-checksummed June 13, 2009 © 2009 Richard Elling 131
  • 132. Checksum Performance ● Metadata – you won't notice ● Data – LZJB is barely noticeable – gzip-9 can be very noticeable ● Geriatric hardware ??? June 13, 2009 © 2009 Richard Elling 132
  • 133. Compression ● Builtin – lzjb, Lempel-Ziv by Jeff Bonwick – gzip, levels 1-9 ● Extensible – new compressors can be added – backwards compatibility issues ● Uses taskqs to take advantage of multi-processor systems Cannot boot from gzip compressed root (RFE is open) June 13, 2009 © 2009 Richard Elling 133
  • 134. Encryption ● Placeholder – details TBD ● http://opensolaris.org/os/project/zfs-crypto ● Complicated by: – Block pointer rewrites – Deduplication June 13, 2009 © 2009 Richard Elling 134
  • 135. Impedance Matching ● RAID arrays & columns ● Label offsets – Older Solaris starting block = 34 – Newer Solaris starting block = 256 June 13, 2009 © 2009 Richard Elling 135
  • 136. Quotas ● File system quotas – quota includes descendants (snapshots, clones) – refquota does not include descendants ● User and group quotas (b114) – Works like refquota, descendants don't count – Not inherited – zfs userspace and groupspace subcommands show quotas ● Users can only see their own and group quota, but can delegate – Managed via properties ● [user|group]quota@[UID|username|SID name|SID number] ● not visible via zfs get all June 13, 2009 © 2009 Richard Elling 136
  • 137. zpool.cache ● Old way – mount / – read /etc/[v]fstab – mount file systems ● ZFS – import pool(s) – find mountable datasets and mount them ● /etc/zpool.cache is a cache of pools to be imported at boot time – No scanning of all available LUNs for pools to import – cachefile property permits selecting an alternate zpool.cache ● Useful for OS installers ● Useful for clusters, where you don't want a booting node to automatically import a pool ● Not persistent (!) June 13, 2009 © 2009 Richard Elling 137
  • 138. Mounting ZFS File Systems ● By default, mountable file systems are mounted when the pool is imported – Controlled by canmount policy (not inherited) ● on – (default) file system is mountable ● off – file system is not mountable – if you want children to be mountable, but not the parent ● noauto – file system must be explicitly mounted (boot environment) ● Can zfs set mountpoint=legacy to use /etc/[v]fstab ● By default, cannot mount on top of non-empty directory – Can override explicitly using zfs mount -O or legacy mountpoint ● Mount properties are persistent, use zfs mount -o for temporary changes Imports are done in parallel, beware of mountpoint races June 13, 2009 © 2009 Richard Elling 138
  • 139. recordsize ● Dynamic – Max 128 kBytes – Min 512 Bytes – Power of 2 ● For most workloads, don't worry about it ● For fixed size workloads, can set to match workloads – Databases ● File systems or Zvols ● zfs set recordsize=8k dataset June 13, 2009 © 2009 Richard Elling 139
  • 140. Delegated Administration ● Fine grain control – users or groups of users – subcommands, parameters, or sets ● Similar to Solaris' Role Based Access Control (RBAC) ● Enable/disable at the pool level – zpool set delegation=on mypool (default) ● Allow/unallow at the dataset level – zfs allow relling snapshot mypool/relling – zfs allow @backupusers snapshot,send mypool/relling – zfs allow mypool/relling June 13, 2009 © 2009 Richard Elling 140
  • 141. Delegation Inheritance Beware of inheritance ● Local – zfs allow -l relling snapshot mypool ● Local + descendants – zfs allow -d relling mount mypool Make sure permissions are set at the correct level June 13, 2009 © 2009 Richard Elling 141
  • 142. Delegatable Subcommands ● allow ● receive ● clone ● rename ● create ● rollback ● destroy ● send ● groupquota ● share ● groupused ● snapshot ● mount ● userquota ● promote ● userused June 13, 2009 © 2009 Richard Elling 142
  • 143. Delegatable Parameters ● aclinherit ● nbmand ● sharenfs ● aclmode ● normalization ● sharesmb ● atime ● quota ● snapdir ● canmount ● readonly ● userprop ● casesensitivity ● recordsize ● utf8only ● checksum ● refquota ● version ● compression ● refreservation ● volsize ● copies ● reservation ● vscan ● devices ● setuid ● xattr ● exec ● shareiscsi ● zoned ● mountpoint June 13, 2009 © 2009 Richard Elling 143
  • 144. Browser User Interface ● Solaris – WebConsole ● Nexenta - ● OSX - ● OpenStorage - June 13, 2009 © 2009 Richard Elling 144
  • 145. Solaris WebConsole June 13, 2009 © 2009 Richard Elling 145
  • 146. Solaris WebConsole June 13, 2009 © 2009 Richard Elling 146
  • 147. Solaris Swap and Dump ● Swap – Solaris does not have automatic swap resizing – Swap as a separate dataset – Swap device is raw, with a refreservation – Blocksize matched to pagesize – Don't really need or want snapshots or clones – Can resize while online, manually ● Dump – Only used during crash dump – Preallocated – No refreservation – Checksum off – Compression off (dumps are already compressed) June 13, 2009 © 2009 Richard Elling 147
  • 148. Performance June 13, 2009 © 2009 Richard Elling 148
  • 149. General Comments ● In general, performs well out of the box ● Standard performance improvement techniques apply ● Lots of DTrace knowledge available ● Typical areas of concern: – ZIL ● check with zilstat, improve with slogs – COW “fragmentation” ● check iostat, improve with L2ARC – Memory consumption ● check with arcstat ● set primarycache property ● can be capped ● can compete with large page aware apps – Compression, or lack thereof June 13, 2009 © 2009 Richard Elling 149
  • 150. ZIL Performance ● Big performance increases demonstrated, especially with SSDs ● NFS servers – 32kByte threshold (zfs_immediate_write_sz) also corresponds to NFSv3 write size ● May cause more work than needed ● See CR6686887 ● Databases – May want different sync policies for logs and data – Current ZIL is pool-wide and enabled for all sync writes – CR6832481 proposes a separate intent log bypass property on a per-dataset basis June 13, 2009 © 2009 Richard Elling 150
  • 151. vdev Cache ● vdev cache occurs at the SPA level – readahead – 10 MBytes per vdev – only caches metadata (as of b70) ● Stats collected as Solaris kstats # kstat -n vdev_cache_stats module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 38.83342625 delegations 14030 hits 105169 misses 59452 snaptime 4564628.18130739 Hit rate = 59%, not bad... June 13, 2009 © 2009 Richard Elling 151
  • 152. Intelligent Prefetching ● Intelligent file-level prefetching occurs at the DMU level ● Feeds the ARC ● In a nutshell, prefetch hits cause more prefetching – Read a block, prefetch a block – If we used the prefetched block, read 2 more blocks – Up to 256 blocks ● Recognizes strided reads – 2 sequential reads of same length and a fixed distance will be coalesced ● Fetches backwards ● Seems to work pretty well, as-is, for most workloads ● Easy to disable in mdb for testing on Solaris – echo zfs_prefetch_disable/W0t1 | mdb -kw June 13, 2009 © 2009 Richard Elling 152
  • 153. I/O Queues ● By default, for devices which can support it, 35 iops are queued to each vdev – Tunable with zfs_vdev_max_pending – echo zfs_vdev_max_pending/W0t10 | mdb -kw ● Implies that more vdevs is better – Consider avoiding RAID array with a single, large LUN ● ZFS I/O scheduler loses control once iops are queued – CR6471212 proposes reserved slots for high-priority iops ● May need to match queues for the entire data path – zfs_vdev_max_pending – Fibre channel, SCSI, SAS, SATA driver – RAID array controller ● Fast disks → small queues, slow disks → larger queues June 13, 2009 © 2009 Richard Elling 153
  • 154. COW Penalty ● COW can negatively affect workloads which have updates and sequential reads – Initial writes will be sequential – Updates (writes) will cause seeks to read data ● Lots of people seem to worry a lot about this ● Only affects HDDs ● Very difficult to speculate about the impact on real-world apps – Large sequential scans of random data hurt anyway – Reads are cached in many places in the data path ● Sysbench benchmark used to test on MySQL w/InnoDB engine – One hour read/write test – select count(*) – repeat, for a week June 13, 2009 © 2009 Richard Elling 154
  • 155. COW Penalty Performance seems to level at about 25% penalty Results compliments of Allan Packer & Neelakanth Nadgir http://blogs.sun.com/realneel/resource/MySQL_Conference_2009_ZFS_MySQL.pdf June 13, 2009 © 2009 Richard Elling 155
  • 156. About Disks... ● Disks still the most important performance bottleneck – Modern processors are multi-core – Default checksums and compression are computationally efficient Average Max Size Rotational Disk Size RPM (GBytes) Latency (ms) Average Seek (ms) HDD 2.5” 5,400 500 5.5 11 HDD 3.5” 5,900 2,000 5.1 16 HDD 3.5” 7,200 1,500 4.2 8 - 8.5 HDD 2.5” 10,000 300 3 4.2 - 4.6 HDD 2.5” 15,000 146 2 3.2 - 3.5 SSD (w) 2.5” N/A 73 0 0.02 - 0.15 SSD (r) 2.5” N/A 500 0 0.02 - 0.15 June 13, 2009 © 2009 Richard Elling 156
  • 157. DirectIO ● UFS forcedirectio option brought the early 1980s design of UFS up to the 1990s ● ZFS designed to run on modern multiprocessors ● Databases or applications which manage their data cache may benefit by disabling file system caching ● Expect L2ARC to improve random reads UFS DirectIO ZFS Unbuffered I/O primarycache=metadata primarycache=none Concurrency Available at inception Improved Async I/O code path Available at inception June 13, 2009 © 2009 Richard Elling 157
  • 158. Hybrid Storage Pool SPA separate log L2ARC Main Pool device cache device Write optimized HDD HDD Read optimized device (SSD) HDD device (SSD) Size (GBytes) < 1 GByte large big Cost write iops/$ size/$ size/$ Performance low-latency writes - low-latency reads June 13, 2009 © 2009 Richard Elling 158
  • 159. Future Plans ● Announced enhancements OpenSolaris Town Hall 2009.06 – de-duplication (see also GreenBytes ZFS+) – user quotas (delivered b114) – access-based enumeration – snapshot reference counting – dynamic LUN expansion (delivering b117?) ● Others – mirror to smaller disk (delivered b117) June 13, 2009 © 2009 Richard Elling 159
  • 160. Its a wrap! Thank You! Questions? Richard.Elling@RichardElling.com June 13, 2009 © 2009 Richard Elling 160