This document provides an overview of the ZFS file system. It discusses ZFS's design goals of simplifying storage and replacing outdated assumptions. It also covers key aspects of ZFS like its layered architecture, use of copy-on-write, lack of need for filesystem checking, virtual devices (vdevs) including mirroring and striping of storage, and dynamic block allocation.
3. ZFS History
• Announced September 14, 2004
• Integration history
✦ SXCE b27 (November 2005)
✦ FreeBSD (April 2007)
✦ Mac OSX Leopard
✤ Preview shown, but removed from Snow Leopard
✤ Disappointed community reforming as the zfs-macos google group
(Oct 2009)
✦ OpenSolaris 2008.05
✦ Solaris 10 6/06 (June 2006)
✦ Linux FUSE (summer 2006)
✦ greenBytes ZFS+ (September 2008)
✦ Linux native port funded by the US DOE (2010)
• More than 45 patents, contributed to the CDDL Patents Common
ZFS Tutorial USENIX LISA’11 3
4. ZFS Design Goals
• Figure out why storage has gotten so complicated
• Blow away 20+ years of obsolete assumptions
• Gotta replace UFS
• Design an integrated system from scratch
End the suffering
ZFS Tutorial USENIX LISA’11 4
5. Limits
• 248 — Number of entries in any individual directory
• 256 — Number of attributes of a file [*]
• 256 — Number of files in a directory [*]
• 16 EiB (264 bytes) — Maximum size of a file system
• 16 EiB — Maximum size of a single file
• 16 EiB — Maximum size of any attribute
• 264 — Number of devices in any pool
• 264 — Number of pools in a system
• 264 — Number of file systems in a pool
• 264 — Number of snapshots of any file system
• 256 ZiB (278 bytes) — Maximum size of any pool
[*] actually constrained to 248 for the number of files in a ZFS file system
ZFS Tutorial USENIX LISA’11 5
6. Understanding Builds
• Build is often referenced when speaking of feature/bug integration
• Short-hand notation: b###
• Distributions derived from Solaris NV (Nevada)
✦ NexentaStor
✦ Nexenta Core Platform
✦ SmartOS
✦ Solaris 11 (nee OpenSolaris)
✦ OpenIndiana
✦ StormOS
✦ BelleniX
✦ SchilliX
✦ MilaX
• OpenSolaris builds
✦ Binary builds died at b134
✦ Source releases continued through b147
• illumos stepping up to fill void left by OpenSolaris’ demise
ZFS Tutorial USENIX LISA’11 6
7. Community Links
• Community links
✦ nexenta.org
✦ nexentastor.org
✦ freebsd.org
✦ zfsonlinux.org
✦ zfs-fuse.net
✦ groups.google.com/group/zfs-macos
• ZFS Community
✦ hub.opensolaris.org/bin/view/Community+Group+zfs/
• IRC channels at irc.freenode.net
✦ #zfs
ZFS Tutorial USENIX LISA’11 7
14. NexentaStor Rosetta Stone
NexentaStor OpenSolaris/ZFS
Volume Storage pool
ZVol Volume
Folder File system
ZFS Tutorial USENIX LISA’11 14
15. nvlists
• name=value pairs
• libnvpair(3LIB)
• Allows ZFS capabilities to change without changing the
physical on-disk format
• Data stored is XDR encoded
• A good thing, used often
ZFS Tutorial USENIX LISA’11 15
16. Versioning
• Features can be added and identified by nvlist entries
• Change in pool or dataset versions do not change physical on-
disk format (!)
✦ does change nvlist parameters
• Older-versions can be used
✦ might see warning messages, but harmless
• Available versions and features can be easily viewed
✦ zpool upgrade -v
✦ zfs upgrade -v
• Online references (broken?)
✦ zpool: hub.opensolaris.org/bin/view/Community+Group+zfs/N
✦ zfs: hub.opensolaris.org/bin/view/Community+Group+zfs/N-1
Don't confuse zpool and zfs versions
ZFS Tutorial USENIX LISA’11 16
17. zpool Versions
VER DESCRIPTION
--- ------------------------------------------------
1 Initial ZFS version
2 Ditto blocks (replicated metadata)
3 Hot spares and double parity RAID-Z
4 zpool history
5 Compression using the gzip algorithm
6 bootfs pool property
7 Separate intent log devices
8 Delegated administration
9 refquota and refreservation properties
10 Cache devices
11 Improved scrub performance
12 Snapshot properties
13 snapused property
14 passthrough-x aclinherit support
Continued...
ZFS Tutorial USENIX LISA’11 17
18. More zpool Versions
VER DESCRIPTION
--- ------------------------------------------------
15 user/group space accounting
16 stmf property support
17 Triple-parity RAID-Z
18 snapshot user holds
19 Log device removal
20 Compression using zle (zero-length encoding)
21 Deduplication
22 Received properties
23 Slim ZIL
24 System attributes
25 Improved scrub stats
26 Improved snapshot deletion performance
27 Improved snapshot creation performance
28 Multiple vdev replacements
For Solaris 10, version 21 is “reserved”
ZFS Tutorial USENIX LISA’11 18
19. zfs Versions
VER DESCRIPTION
----------------------------------------------
1 Initial ZFS filesystem version
2 Enhanced directory entries
3 Case insensitive and File system unique
identifier (FUID)
4 userquota, groupquota properties
5 System attributes
ZFS Tutorial USENIX LISA’11 19
20. Copy on Write
1. Initial block tree 2. COW some data
3. COW metadata 4. Update Uberblocks & free
ZFS Tutorial USENIX LISA’11 20
21. COW Notes
• COW works on blocks, not files
• ZFS reserves 32 MBytes or 1/64 of
pool size
✦ COWs need some free space to
remove files
✦ need space for ZIL
• For fixed-record size workloads
“fragmentation” and “poor
performance” can occur if the
recordsize is not matched
• Spatial distribution is good fodder for
performance speculation
✦ affects HDDs
✦ moot for SSDs
ZFS Tutorial USENIX LISA’11 21
22. To fsck or not to fsck
• fsck was created to fix known inconsistencies in file system
metadata
✦ UFS is not transactional
✦ metadata inconsistencies must be reconciled
✦ does NOT repair data – how could it?
• ZFS doesn't need fsck, as-is
✦ all on-disk changes are transactional
✦ COW means previously existing, consistent metadata is not
overwritten
✦ ZFS can repair itself
✤ metadata is at least dual-redundant
✤ data can also be redundant
• Reality check – this does not mean that ZFS is not susceptible to
corruption
✦ nor is any other file system
ZFS Tutorial USENIX LISA’11 22
27. Uberblocks
• Sized based on minimum device block size
• Stored in 128-entry circular queue
• Only one uberblock is active at any time
✦ highest transaction group number
✦ correct SHA-256 checksum
• Stored in machine's native format
✦ A magic number is used to determine endian format when
imported
• Contains pointer to Meta Object Set (MOS)
Device Block Size Uberblock Size Queue Entries
512 Bytes,1 KB 1 KB 128
2 KB 2 KB 64
4 KB 4 KB 32
ZFS Tutorial USENIX LISA’11 27
28. About Sizes
• Sizes are dynamic
• LSIZE = logical size
• PSIZE = physical size after compression
• ASIZE = allocated size including:
✦ physical size
✦ raidz parity
✦ gang blocks
Old notions of size reporting confuse people
ZFS Tutorial USENIX LISA’11 28
30. Dynamic Striping
• RAID-0
✦ SNIA definition: fixed-length sequences of virtual disk data
addresses are mapped to sequences of member disk addresses
in a regular rotating pattern
• Dynamic Stripe
✦ Data is dynamically mapped to member disks
✦ No fixed-length sequences
✦ Allocate up to ~1 MByte/vdev before changing vdev
✦ vdevs can be different size
✦ Good combination of the concatenation feature with RAID-0
performance
ZFS Tutorial USENIX LISA’11 30
32. Mirroring
• Straightforward: put N copies of the data on N vdevs
• Unlike RAID-1
✦ No 1:1 mapping at the block level
✦ vdev labels are still at beginning and end
✦ vdevs can be of different size
✤ effective space is that of smallest vdev
• Arbitration: ZFS does not blindly trust either side of mirror
✦ Most recent, correct view of data wins
✦ Checksums validate data
ZFS Tutorial USENIX LISA’11 32
33. Dynamic vdev Replacement
• zpool replace poolname vdev [vdev]
• Today, replacing vdev must be same size or larger
✦ NexentaStor 2 ‒ as measured by blocks
✦ NexentaStor 3 ‒ as measured by metaslabs
• Replacing all vdevs in a top-level vdev with larger vdevs results
in top-level vdev resizing
• Expansion policy controlled by:
✦ NexentaStor 2 ‒ resize on import
✦ NexentaStor 3 ‒ zpool autoexpand property
15G 10G
10G 15G 10G 20G 15G 20G
10G 15G 20G 20G 20G 20G
10G
10G Mirror 10G Mirror 15G Mirror 15G Mirror 20G Mirror
ZFS Tutorial USENIX LISA’11 33
34. RAIDZ
• RAID-5
✦ Parity check data is distributed across the RAID array's disks
✦ Must read/modify/write when data is smaller than stripe width
• RAIDZ
✦ Dynamic data placement
✦ Parity added as needed
✦ Writes are full-stripe writes
✦ No read/modify/write (write hole)
• Arbitration: ZFS does not blindly trust any device
✦ Does not rely on disk reporting read error
✦ Checksums validate data
✦ If checksum fails, read parity
Space used is dependent on how used
ZFS Tutorial USENIX LISA’11 34
36. RAIDZ and Block Size
If block size >> N * sector size, space consumption is like RAID-5
If block size = sector size, space consumption is like mirroring
PSIZE=2KB
ASIZE=2.5KB DiskA DiskB DiskC DiskD DiskE
P0 D0:0 D0:1 D0:2 D0:3
P1 D1:0 D1:1 P2:0 D2:0
PSIZE=1KB D2:1 D2:2 D2:3 P2:1 D2:4
ASIZE=1.5KB D2:5 Gap P3 D3:0
PSIZE=3KB PSIZE=512 bytes
ASIZE=4KB + Gap ASIZE=1KB
Sector size = 512 bytes
Sector size can impact space savings
ZFS Tutorial USENIX LISA’11 36
37. RAID-5 Write Hole
• Occurs when data to be written is smaller than stripe size
• Must read unallocated columns to recalculate the parity or the
parity must be read/modify/write
• Read/modify/write is risky for consistency
✦ Multiple disks
✦ Reading independently
✦ Writing independently
✦ System failure before all writes are complete to media could
result in data loss
• Effects can be hidden from host using RAID array with
nonvolatile write cache, but extra I/O cannot be hidden from
disks
ZFS Tutorial USENIX LISA’11 37
38. RAIDZ2 and RAIDZ3
• RAIDZ2 = double parity RAIDZ
• RAIDZ3 = triple parity RAIDZ
• Sorta like RAID-6
✦ Parity 1: XOR
✦ Parity 2: another Reed-Soloman syndrome
✦ Parity 3: yet another Reed-Soloman syndrome
• Arbitration: ZFS does not blindly trust any device
✦ Does not rely on disk reporting read error
✦ Checksums validate data
✦ If data not valid, read parity
✦ If data still not valid, read other parity
Space used is dependent on how used
ZFS Tutorial USENIX LISA’11 38
39. Evaluating Data Retention
• MTTDL = Mean Time To Data Loss
• Note: MTBF is not constant in the real world, but keeps math
simple
• MTTDL[1] is a simple MTTDL model
• No parity (single vdev, striping, RAID-0)
✦ MTTDL[1] = MTBF / N
• Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
✦ MTTDL[1] = MTBF2 / (N * (N-1) * MTTR)
• Double Parity (3-way mirror, RAIDZ2, RAID-6)
✦ MTTDL[1] = MTBF3 / (N * (N-1) * (N-2) * MTTR2)
• Triple Parity (4-way mirror, RAIDZ3)
✦ MTTDL[1] = MTBF4 / (N * (N-1) * (N-2) * (N-3) * MTTR3)
ZFS Tutorial USENIX LISA’11 39
40. Another MTTDL Model
• MTTDL[1] model doesn't take into account unrecoverable
read
• But unrecoverable reads (UER) are becoming the dominant
failure mode
✦ UER specifed as errors per bits read
✦ More bits = higher probability of loss per vdev
• MTTDL[2] model considers UER
ZFS Tutorial USENIX LISA’11 40
41. Why Worry about UER?
• Richard's study
✦ 3,684 hosts with 12,204 LUNs
✦ 11.5% of all LUNs reported read errors
• Bairavasundaram et.al. FAST08
www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
✦ 1.53M LUNs over 41 months
✦ RAID reconstruction discovers 8% of checksum mismatches
✦ “For some drive models as many as 4% of drives develop
checksum mismatches during the 17 months examined”
ZFS Tutorial USENIX LISA’11 41
42. Why Worry about UER?
• RAID array study
ZFS Tutorial USENIX LISA’11 42
43. Why Worry about UER?
• RAID array study
Unrecoverable Disk Disappeared
Reads “disk pull”
“Disk pull” tests aren’t very useful
ZFS Tutorial USENIX LISA’11 43
44. MTTDL[2] Model
• Probability that a reconstruction will fail
✦ Precon_fail = (N-1) * size / UER
• Model doesn't work for non-parity schemes
✦ single vdev, striping, RAID-0
• Single Parity (mirror, RAIDZ, RAID-1, RAID-5)
✦ MTTDL[2] = MTBF / (N * Precon_fail)
• Double Parity (3-way mirror, RAIDZ2, RAID-6)
✦ MTTDL[2] = MTBF2/ (N * (N-1) * MTTR * Precon_fail)
• Triple Parity (4-way mirror, RAIDZ3)
✦ MTTDL[2] = MTBF3/ (N * (N-1) * (N-2) * MTTR2 *
Precon_fail)
ZFS Tutorial USENIX LISA’11 44
50. Dependability Use Case
• Customer has 15+ TB of read-mostly data
• 16-slot, 3.5” drive chassis
• 2 TB HDDs
• Option 1: one raidz2 set
✦ 24 TB available space
✤ 12 data
✤ 2 parity
✤ 2 hot spares, 48 hour disk replacement time
✦ MTTDL[1] = 1,790,000 years
• Option 2: two raidz2 sets
✦ 24 TB available space (each set)
✤ 6 data
✤ 2 parity
✤ no hot spares
✦ MTTDL[1] = 7,450,000 years
ZFS Tutorial USENIX LISA’11 50
51. Ditto Blocks
• Recall that each blkptr_t contains 3 DVAs
• Dataset property used to indicate how many copies (aka ditto
blocks) of data is desired
✦ Write all copies
✦ Read any copy
✦ Recover corrupted read from a copy
• Not a replacement for mirroring
✦ For single disk, can handle data loss on approximately 1/8
contiguous space
• Easier to describe in pictures...
copies parameter Data copies Metadata copies
copies=1 (default) 1 2
copies=2 2 3
copies=3 3 3
ZFS Tutorial USENIX LISA’11 51
54. When Good Data Goes Bad
File system If it’s a metadata Or we get
does bad read block FS panics back bad
Can not tell does disk rebuild data
ZFS Tutorial USENIX LISA’11 54
55. Checksum Verification
ZFS verifies checksums for every read
Repairs data when possible (mirror, raidz, copies>1)
Read bad data Read good data Repair bad data
ZFS Tutorial USENIX LISA’11 55
57. ZIO Framework
• All physical disk I/O goes through ZIO Framework
• Translates DVAs into Logical Block Address (LBA) on leaf
vdevs
✦ Keeps free space maps (spacemap)
✦ If contiguous space is not available:
✤ Allocate smaller blocks (the gang)
✤ Allocate gang block, pointing to the gang
• Implemented as multi-stage pipeline
✦ Allows extensions to be added fairly easily
• Handles I/O errors
ZFS Tutorial USENIX LISA’11 57
58. ZIO Write Pipeline
ZIO State Compression Checksum DVA vdev I/O
open
compress if
savings >
12.5%
generate
allocate
start
start
start
done
done
done
assess
assess
assess
done
Gang and deduplicaiton activity elided, for clarity
ZFS Tutorial USENIX LISA’11 58
59. ZIO Read Pipeline
ZIO State Compression Checksum DVA vdev I/O
open
start
start
start
done
done
done
assess
assess
assess
verify
decompress
done
Gang and deduplicaiton activity elided, for clarity
ZFS Tutorial USENIX LISA’11 59
60. VDEV – Virtual Device Subsytem
• Where mirrors, RAIDZ, and Name Priority
RAIDZ2 are implemented
NOW 0
✦ Surprisingly few lines of code
SYNC_READ 0
needed to implement RAID
SYNC_WRITE 0
• Leaf vdev (physical device) I/O FREE 0
management CACHE_FILL 0
✦ Number of outstanding iops LOG_WRITE 0
✦ Read-ahead cache ASYNC_READ 4
• Priority scheduling ASYNC_WRITE 4
RESILVER 10
SCRUB 20
ZFS Tutorial USENIX LISA’11 60
62. Object Cache
• UFS uses page cache managed by the virtual memory system
• ZFS does not use the page cache, except for mmap'ed files
• ZFS uses a Adaptive Replacement Cache (ARC)
• ARC used by DMU to cache DVA data objects
• Only one ARC per system, but caching policy can be changed
on a per-dataset basis
• Seems to work much better than page cache ever did for UFS
ZFS Tutorial USENIX LISA’11 62
63. Traditional Cache
• Works well when data being accessed was recently added
• Doesn't work so well when frequently accessed data is evicted
Misses cause insert
MRU
Dynamic caches can change
Cache size size by either not evicting
or aggressively evicting
LRU
Evict the oldest
ZFS Tutorial USENIX LISA’11 63
64. ARC – Adaptive Replacement Cache
Evict the oldest single-use entry
LRU
Recent
Cache
Miss
MRU
Evictions and dynamic
MFU size resizing needs to choose best
Hit
cache to evict (shrink)
Frequent
Cache
LFU
Evict the oldest multiple accessed entry
ZFS Tutorial USENIX LISA’11 64
65. ARC with Locked Pages
Evict the oldest single-use entry
Cannot evict LRU
locked pages!
Recent
Cache
Miss
MRU
MFU size
Hit
Frequent
If hit occurs Cache
within 62 ms
LFU
Evict the oldest multiple accessed entry
ZFS ARC handles mixed-size pages
ZFS Tutorial USENIX LISA’11 65
66. L2ARC – Level 2 ARC
• Data soon to be evicted from the ARC is added
to a queue to be sent to cache vdev
✦ Another thread sends queue to cache vdev ARC
✦ Data is copied to the cache vdev with a throttle
data soon to
to limit bandwidth consumption be evicted
✦ Under heavy memory pressure, not all evictions
will arrive in the cache vdev
• ARC directory remains in memory
• Good idea - optimize cache vdev for fast reads
✦ lower latency than pool disks
✦ inexpensive way to “increase memory”
cache
• Content considered volatile, no raid needed
• Monitor usage with zpool iostat and ARC kstats
ZFS Tutorial USENIX LISA’11 66
67. ARC Directory
• Each ARC directory entry contains arc_buf_hdr structs
✦ Info about the entry
✦ Pointer to the entry
• Directory entries have size, ~200 bytes
• ZFS block size is dynamic, sector size to 128 kBytes
• Disks are large
• Suppose we use a Seagate LP 2 TByte disk for the L2ARC
✦ Disk has 3,907,029,168 512 byte sectors, guaranteed
✦ Workload uses 8 kByte fixed record size
✦ RAM needed for arc_buf_hdr entries
✤ Need = (3,907,029,168 - 9,232) * 200 / 16 = ~48 GBytes
• Don't underestimate the RAM needed for large L2ARCs
ZFS Tutorial USENIX LISA’11 67
68. ARC Tips
• In general, it seems to work well for most workloads
• ARC size will vary, based on usage
✦ Default target max is 7/8 of physical memory or (memory - 1
GByte)
✦ Target min is 64 MB
✦ Metadata capped at 1/4 of max ARC size
• Dynamic size can be reduced when:
✦ page scanner is running
✤ freemem < lotsfree + needfree + desfree
✦ swapfs does not have enough space so that anonymous
reservations can succeed
✤ availrmem < swapfs_minfree + swapfs_reserve + desfree
✦ [x86 only] kernel heap space more than 75% full
• Can limit at boot time
ZFS Tutorial USENIX LISA’11 68
69. Observing ARC
• ARC statistics stored in kstats
• kstat -n arcstats
• Interesting statistics:
✦ size = current ARC size
✦ p = size of MFU cache
✦ c = target ARC size
✦ c_max = maximum target ARC size
✦ c_min = minimum target ARC size
✦ l2_hdr_size = space used in ARC by L2ARC
✦ l2_size = size of data in L2ARC
ZFS Tutorial USENIX LISA’11 69
71. More ARC Tips
• Performance
✦ Prior to b107, L2ARC fill rate was limited to 8 MB/sec
✦ After b107, cold L2ARC fill rate increases to 16 MB/sec
• Internals tracked by kstats in Solaris
✦ Use memory_throttle_count to observe pressure to
evict
• Dedup Table (DDT) also uses ARC
✦ lots of dedup objects need lots of RAM
✦ field reports that L2ARC can help with dedup
L2ARC keeps its directory in kernel memory
ZFS Tutorial USENIX LISA’11 71
74. Transaction Engine
• Manages physical I/O
• Transactions grouped into transaction group (txg)
✦ txg updates
✦ All-or-nothing
✦ Commit interval
✤ Older versions: 5 seconds
✤ Less old versions: 30 seconds
✤ b143 and later: 5 seconds
• Delay committing data to physical storage
✦ Improves performance
✦ A bad thing for sync workload performance – hence the ZFS
Intent Log (ZIL)
30 second delay can impact failure detection time
ZFS Tutorial USENIX LISA’11 74
75. ZIL – ZFS Intent Log
• DMU is transactional, and likes to group I/O into transactions
for later commits, but still needs to handle “write it now”
desire of sync writers
✦ NFS
✦ Databases
• ZIL recordsize inflation can occur for some workloads
✦ May cause larger than expected actual I/O for sync workloads
✦ Oracle redo logs
✦ No slog: can tune zfs_immediate_write_sz,
zvol_immediate_write_sz
✦ With slog: use logbias property instead
• Never read, except at import (eg reboot), when transactions
may need to be rolled forward
ZFS Tutorial USENIX LISA’11 75
76. Separate Logs (slogs)
• ZIL competes with pool for IOPS
✦ Applications wait for sync writes to be on nonvolatile media
✦ Very noticeable on HDD JBODs
• Put ZIL on separate vdev, outside of pool
✦ ZIL writes tend to be sequential
✦ No competition with pool for IOPS
✦ Downside: slog device required to be operational at import
✦ NexentaStor 3 allows slog device removal
✦ Size of separate log < than size of RAM (duh)
• 10x or more performance improvements possible
✦ Nonvolatile RAM card
✦ Write-optimized SSD
✦ Nonvolatile write cache on RAID array
ZFS Tutorial USENIX LISA’11 76
77. zilstat
• http://www.richardelling.com/Home/scripts-and-programs-1/
zilstat
• Integrated into NexentaStor 3.0.3
✦ nmc: show performance zil
ZFS Tutorial USENIX LISA’11 77
78. Synchronous Write Destination
Without separate log
Sync I/O size >
ZIL Destination
zfs_immediate_write_sz ?
no ZIL log
yes bypass to pool
With separate log
logbias? ZIL Destination
latency (default) log device
throughput bypass to pool
Default zfs_immediate_write_sz = 32 kBytes
ZFS Tutorial USENIX LISA’11 78
79. ZIL Synchronicity Project
• All-or-nothing policies don’t work well, in general
• ZIL Synchronicity project proposed by Robert Milkowski
✦ http://milek.blogspot.com
• Adds new sync property to datasets
• Arrived in b140
sync Parameter Behaviour
Policy follows previous design: write
standard (default)
immediate size and separate logs
always All writes become synchronous (slow)
disabled Synchronous write requests are ignored
ZFS Tutorial USENIX LISA’11 79
80. Disabling the ZIL
• Preferred method: change dataset sync property
• Rule 0: Don’t disable the ZIL
• If you love your data, do not disable the ZIL
• You can find references to this as a way to speed up ZFS
✦ NFS workloads
✦ “tar -x” benchmarks
• Golden Rule: Don’t disable the ZIL
• Can set via mdb, but need to remount the file system
• Friends don’t let friends disable the ZIL
• Older Solaris - can set in /etc/system
• NexentaStor has checkbox for disabling ZIL
• Nostradamus wrote, “disabling the ZIL will lead to the
apocalypse”
ZFS Tutorial USENIX LISA’11 80
82. Dataset & Snapshot Layer
• Object
✦ Allocated storage
✦ dnode describes collection of blocks
• Object Set
Dataset Directory
✦ Group of related objects Dataset
• Dataset Object Set Childmap
✦ Snapmap: snapshot relationships
Object
Object
✦ Space usage Object Properties
• Dataset directory Snapmap
✦ Childmap: dataset relationships
✦ Properties
ZFS Tutorial USENIX LISA’11 82
83. flash
Copy on Write
1. Initial block tree 2. COW some data
3. COW metadata 4. Update Uberblocks & free
z
ZFS Tutorial USENIX LISA’11 83
84. zfs snapshot
• Create a read-only, point-in-time window into the dataset (file
system or Zvol)
• Computationally free, because of COW architecture
• Very handy feature
✦ Patching/upgrades
• Basis for time-related snapshot interfaces
✦ Solaris Time Slider
✦ NexentaStor Delorean Plugin
✦ NexentaStor Virtual Machine Data Center
ZFS Tutorial USENIX LISA’11 84
85. Snapshot
• Create a snapshot by not free'ing COWed blocks
• Snapshot creation is fast and easy
• Number of snapshots determined by use – no hardwired limit
• Recursive snapshots also possible
Snapshot tree Current tree
root root
ZFS Tutorial USENIX LISA’11 85
87. Clones
• Snapshots are read-only
• Clones are read-write based upon a snapshot
• Child depends on parent
✦ Cannot destroy parent without destroying all children
✦ Can promote children to be parents
• Good ideas
✦ OS upgrades
✦ Change control
✦ Replication
✤ zones
✤ virtual disks
ZFS Tutorial USENIX LISA’11 87
88. zfs clone
• Create a read-write file system from a read-only snapshot
• Solaris boot environment administation
Install Checkpoint Clone Checkpoint
OS rev1 OS rev1 OS rev1 OS rev1 OS rev1
rootfs- rootfs- rootfs- rootfs-
nmu- nmu- nmu- nmu-
001 001 001 001
patch/
OS rev1 OS rev1 OS rev1
upgrade
clone clone clone
rootfs-
nmu-
002
grubboot
manager
Origin snapshot cannot be destroyed, if clone exists
ZFS Tutorial USENIX LISA’11 88
90. What is Deduplication?
• A $2.1 Billion feature
• 2009 buzzword of the year
• Technique for improving storage space efficiency
✦ Trades big I/Os for small I/Os
✦ Does not eliminate I/O
• Implementation styles
✦ offline or post processing
✤ data written to nonvolatile storage
✤ process comes along later and dedupes data
✤ example: tape archive dedup
✦ inline
✤ data is deduped as it is being allocated to nonvolatile storage
✤ example: ZFS
ZFS Tutorial USENIX LISA’11 90
91. Dedup how-to
• Given a bunch of data
• Find data that is duplicated
• Build a lookup table of references to data
• Replace duplicate data with a pointer to the entry in the
lookup table
• Grainularity
✦ file
✦ block
✦ byte
ZFS Tutorial USENIX LISA’11 91
92. Dedup in ZFS
• Leverage block-level checksums
✦ Identify blocks which might be duplicates
✦ Variable block size is ok
• Synchronous implementation
✦ Data is deduped as it is being written
• Scalable design
✦ No reference count limits
• Works with existing features
✦ compression
✦ copies
✦ scrub
✦ resilver
• Implemented in ZIO pipeline
ZFS Tutorial USENIX LISA’11 92
93. Deduplication Table (DDT)
• Internal implementation
✦ Adelson-Velskii, Landis (AVL) tree
✦ Typical table entry ~270 bytes
✤ checksum
✤ logical size
✤ physical size
✤ references
✦ Table entry size increases as the number of references
increases
ZFS Tutorial USENIX LISA’11 93
95. Reference Counts
• Problem: loss of the referenced data affects all referrers
• Solution: make additional copies of referred data based upon a
threshold count of referrers
✦ leverage copies (ditto blocks)
✦ pool-level threshold for automatically adding ditto copies
✤ set via dedupditto pool property
# zpool set dedupditto=50 zwimming
✤ add 2nd copy when dedupditto references (50) reached
✤ add 3rd copy when dedupditto2 references (2500) reached
ZFS Tutorial USENIX LISA’11 95
96. Verification
write()
compress
checksum
DDT entry lookup
yes no
DDT
verify?
match?
no yes
read data
data yes
add reference
match?
no
new entry
ZFS Tutorial USENIX LISA’11 96
97. Enabling Dedup
• Set dedup property for each dataset to be deduped
• Remember: properties are inherited
• Remember: only applies to newly written data
dedup checksum verify?
on
SHA256 no
sha256
on,verify
SHA256 yes
sha256,verify
Fletcher is considered too weak, without verify
ZFS Tutorial USENIX LISA’11 97
98. Dedup Accounting
• ...and you thought compression accounting was hard...
• Remember: dedup works at pool level
✦ dataset-level accounting doesn’t see other datasets
✦ pool-level accounting is always correct
zfs list
NAME USED AVAIL REFER MOUNTPOINT
bar 7.56G 449G 22K /bar
bar/ws 7.56G 449G 7.56G /bar/ws
dozer 7.60G 455G 22K /dozer
dozer/ws 7.56G 455G 7.56G /dozer/ws
tank 4.31G 456G 22K /tank
tank/ws 4.27G 456G 4.27G /tank/ws
zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
bar 464G 7.56G 456G 1% 1.00x ONLINE -
dozer 464G 1.43G 463G 0% 5.92x ONLINE -
tank 464G 957M 463G 0% 5.39x ONLINE -
ZFS Tutorial DataUSENIX LISA’11team
courtesy of the ZFS 98
101. Over-the-wire Dedup
• Dedup is also possible over the send/receive pipe
✦ Blocks with no checksum are considered duplicates (no verify
option)
✦ First copy sent as usual
✦ Subsequent copies sent by reference
• Independent of dedup status of originating pool
✦ Receiving pool knows about blocks which have already arrived
• Can be a win for dedupable data, especially over slow wires
• Remember: send/receive version rules still apply
# zfs send -DR zwimming/stuff
ZFS Tutorial USENIX LISA’11 101
102. Dedup Performance
• Dedup can save space and bandwidth
• Dedup increases latency
✦ Caching data improves latency
✦ More memory → more data cached
✦ Cache performance heirarchy
✤ RAM: fastest
✤ L2ARC on SSD: slower
✤ Pool HDD: dreadfully slow
• ARC is currently not deduped
• Difficult to predict
✦ Dependent variable: number of blocks
✦ Estimate 270 bytes per unique block
✦ Example:
✤ 50M blocks * 270 bytes/block = 13.5 GBytes
ZFS Tutorial USENIX LISA’11 102
103. Deduplication Use Cases
Data type Dedupe Compression
Home directories ✔✔ ✔✔
Internet content ✔ ✔
Media and video ✔✔ ✔
Life sciences ✘ ✔✔
Oil and Gas (seismic) ✘ ✔✔
Virtual machines ✔✔ ✘
Archive ✔✔✔✔ ✔
ZFS Tutorial USENIX LISA’11 103
106. zpool create
• zpool create poolname vdev-configuration
• nmc: setup volume create
✦ vdev-configuration examples
✤ mirror c0t0d0 c3t6d0
✤ mirror c0t0d0 c3t6d0 mirror c4t0d0 c0t1d6
✤ mirror disk1s0 disk2s0 cache disk4s0 log disk5
✤ raidz c0d0s1 c0d1s1 c1d2s0 spare c1d3s0
• Solaris
✦ Additional checks for disk/slice overlaps or in use
✦ Whole disks are given EFI labels
• Can set initial pool or dataset properties
• By default, creates a file system with the same name
✦ poolname pool → /poolname file system
People get confused by a file system with same name as the pool
ZFS Tutorial USENIX LISA’11 106
107. zpool destroy
• Destroy the pool and all datasets therein
• zpool destroy poolname
✦ Can (try to) force with “-f”
✦ There is no “are you sure?” prompt – if you weren't sure, you
would not have typed “destroy”
• nmc: destroy volume volumename
✦ nmc prompts for confirmation, by default
zpool destroy is destructive... really! Use with caution!
ZFS Tutorial USENIX LISA’11 107
108. zpool add
• Adds a device to the pool as a top-level vdev
• Does NOT not add columns to a raidz set
• Does NOT attach a mirror – use zpool attach instead
• zpool add poolname vdev-configuration
✦ vdev-configuration can be any combination also used for zpool
create
✦ Complains if the added vdev-configuration would cause a
different data protection scheme than is already in use
✤ use “-f” to override
✦ Good idea: try with “-n” flag first
✤ will show final configuration without actually performing the add
• nmc: setup volume volumename grow
Do not add a device which is in use as a cluster quorum device
ZFS Tutorial USENIX LISA’11 108
109. zpool remove
• Remove a top-level vdev from the pool
• zpool remove poolname vdev
• nmc: setup volume volumename remove-lun
• Today, you can only remove the following vdevs:
✦ cache
✦ hot spare
✦ separate log (b124, NexentaStor 3.0)
Don't confuse “remove” with “detach”
ZFS Tutorial USENIX LISA’11 109
110. zpool attach
• Attach a vdev as a mirror to an existing vdev
• zpool attach poolname existing-vdev vdev
• nmc: setup volume volumename attach-lun
• Attaching vdev must be the same size or larger than the
existing vdev
vdev Configurations
ok simple vdev → mirror
ok mirror
ok log → mirrored log
no RAIDZ
no RAIDZ2
no RAIDZ3
ZFS Tutorial USENIX LISA’11 110
111. zpool detach
• Detach a vdev from a mirror
• zpool detach poolname vdev
• nmc: setup volume volumename detach-lun
• A resilvering vdev will wait until resilvering is complete
ZFS Tutorial USENIX LISA’11 111
112. zpool replace
• Replaces an existing vdev with a new vdev
• zpool replace poolname existing-vdev vdev
• nmc: setup volume volumename replace-lun
• Effectively, a shorthand for “zpool attach” followed by “zpool
detach”
• Attaching vdev must be the same size or larger than the
existing vdev
• Works for any top-level vdev-configuration, including RAIDZ
“Same size” literally means the same number of blocks until b117.
Many “same size” disks have different number of available blocks.
ZFS Tutorial USENIX LISA’11 112
113. zpool import
• Import a pool and mount all mountable datasets
• Import a specific pool
✦ zpool import poolname
✦ zpool import GUID
✦ nmc: setup volume import
• Scan LUNs for pools which may be imported
✦ zpool import
• Can set options, such as alternate root directory or other
properties
✦ alternate root directory important for rpool or syspool
Beware of zpool.cache interactions
Beware of artifacts, especially partial artifacts
ZFS Tutorial USENIX LISA’11 113
114. zpool export
• Unmount datasets and export the pool
• zpool export poolname
• nmc: setup volume volumename export
• Removes pool entry from zpool.cache
✦ useful when unimported pools remain in zpool.cache
ZFS Tutorial USENIX LISA’11 114
115. zpool upgrade
• Display current versions
✦ zpool upgrade
• View available upgrade versions, with features, but don't
actually upgrade
✦ zpool upgrade -v
• Upgrade pool to latest version
✦ zpool upgrade poolname
✦ nmc: setup volume volumename version-
upgrade
• Upgrade pool to specific version
Once you upgrade, there is no downgrade
Beware of grub and rollback issues
ZFS Tutorial USENIX LISA’11 115
116. zpool history
• Show history of changes made to the pool
• nmc and Solaris use same command
ZFS Tutorial USENIX LISA’11 116
117. zpool history
• Show history of changes made to the pool
• nmc and Solaris use same command
# zpool history rpool
History for 'rpool':
2009-03-04.07:29:46 zpool create -f -o failmode=continue -R /a -m legacy -o
cachefile=/tmp/root/etc/zfs/zpool.cache rpool c0t0d0s0
2009-03-04.07:29:47 zfs set canmount=noauto rpool
2009-03-04.07:29:47 zfs set mountpoint=/rpool rpool
2009-03-04.07:29:47 zfs create -o mountpoint=legacy rpool/ROOT
2009-03-04.07:29:48 zfs create -b 4096 -V 2048m rpool/swap
2009-03-04.07:29:48 zfs create -b 131072 -V 1024m rpool/dump
2009-03-04.07:29:49 zfs create -o canmount=noauto rpool/ROOT/snv_106
2009-03-04.07:29:50 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.07:29:50 zfs set mountpoint=/ rpool/ROOT/snv_106
2009-03-04.07:29:51 zfs set canmount=on rpool
2009-03-04.07:29:51 zfs create -o mountpoint=/export rpool/export
2009-03-04.07:29:51 zfs create rpool/export/home
2009-03-04.00:21:42 zpool import -f -R /a 17111649328928073943
2009-03-04.00:21:42 zpool export rpool
2009-03-04.08:47:08 zpool set bootfs=rpool rpool
2009-03-04.08:47:08 zpool set bootfs=rpool/ROOT/snv_106 rpool
2009-03-04.08:47:12 zfs snapshot rpool/ROOT/snv_106@snv_b108
2009-03-04.08:47:12 zfs clone rpool/ROOT/snv_106@snv_b108 rpool/ROOT/snv_b108
...
ZFS Tutorial USENIX LISA’11 116
118. zpool status
• Shows the status of the current pools, including their
configuration
• Important troubleshooting step
• nmc and Solaris use same command
# zpool status
…
pool: zwimming
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
zwimming ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s0 ONLINE 0 0 0
c0t0d0s7 ONLINE 0 0 0
errors: No known data errors
Understanding status output error messages can be tricky
ZFS Tutorial USENIX LISA’11 117
120. zpool iostat
• Show pool physical I/O activity, in an iostat-like manner
• Solaris: fsstat will show I/O activity looking into a ZFS file
system
• Especially useful for showing slog activity
• nmc and Solaris use same command
# zpool iostat -v
capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
rpool 16.5G 131G 0 0 1.16K 2.80K
c0t0d0s0 16.5G 131G 0 0 1.16K 2.80K
------------ ----- ----- ----- ----- ----- -----
zwimming 135G 14.4G 0 5 2.09K 27.3K
mirror 135G 14.4G 0 5 2.09K 27.3K
c0t2d0s0 - - 0 3 1.25K 27.5K
c0t0d0s7 - - 0 2 1.27K 27.5K
------------ ----- ----- ----- ----- ----- -----
Unlike iostat, does not show latency
ZFS Tutorial USENIX LISA’11 119
121. zpool scrub
• Manually starts scrub
✦ zpool scrub poolname
• Scrubbing performed in background
• Use zpool status to track scrub progress
• Stop scrub
✦ zpool scrub -s poolname
• How often to scrub?
✦ Depends on level of paranoia
✦ Once per month seems reasonable
✦ After a repair or recovery procedure
• NexentaStor auto-scrub features easily manages scrubs and
schedules
Estimated scrub completion time improves over time
ZFS Tutorial USENIX LISA’11 120
125. zfs create, destroy
• By default, a file system with the same name as the pool is
created by zpool create
• Dataset name format is: pool/name[/name ...]
• File system / folder
✦ zfs create dataset-name
✦ nmc: create folder
✦ zfs destroy dataset-name
✦ nmc: destroy folder
• Zvol
✦ zfs create -V size dataset-name
✦ nmc: create zvol
✦ zfs destroy dataset-name
✦ nmc: destroy zvol
ZFS Tutorial USENIX LISA’11 124
126. zfs mount, unmount
• Note: mount point is a file system parameter
✦ zfs get mountpoint fs-name
• Rarely used subcommand (!)
• Display mounted file systems
✦ zfs mount
• Mount a file system
✦ zfs mount fs-name
✦ zfs mount -a
• Unmount (not umount)
✦ zfs unmount fs-name
✦ zfs unmount -a
ZFS Tutorial USENIX LISA’11 125
127. zfs list
• List mounted datasets
• NexentaStor 2: listed everything
• NexentaStor 3: do not list snapshots
✦ See zpool listsnapshots property
• Examples
✦ zfs list
✦ zfs list -t snapshot
✦ zfs list -H -o name
ZFS Tutorial USENIX LISA’11 126
128. Replication Services
Days
Traditional Backup NDMP
Hours Auto-Tier
Recovery rsync
Point Text
Auto-Sync
ZFS send/receive
Objective
Seconds
Auto-CDP Application Level
AVS (SNDR) Mirror Replication
Slower Faster
System I/O Performance
ZFS Tutorial USENIX LISA’11 127
129. zfs send, receive
• Send
✦ send a snapshot to stdout
✦ data is decompressed
• Receive
✦ receive a snapshot from stdin
✦ receiving file system parameters apply (compression, et.al)
• Can incrementally send snapshots in time order
• Handy way to replicate dataset snapshots
• NexentaStor
✦ simplifies management
✦ manages snapshots and send/receive to remote systems
• Only method for replicating dataset properties, except quotas
• NOT a replacement for traditional backup solutions
ZFS Tutorial USENIX LISA’11 128
131. zfs upgrade
• Display current versions
✦ zfs upgrade
• View available upgrade versions, with features, but don't
actually upgrade
✦ zfs upgrade -v
• Upgrade pool to latest version
✦ zfs upgrade dataset
• Upgrade pool to specific version
✦ zfs upgrade -V version dataset
• NexentaStor: not needed until 3.0
You can upgrade, there is no downgrade
Beware of grub and rollback issues
ZFS Tutorial USENIX LISA’11 130
135. Properties
• Properties are stored in an nvlist
• By default, are inherited
• Some properties are common to all datasets, but a specific
dataset type may have additional properties
• Easily set or retrieved via scripts
• In general, properties affect future file system activity
zpool get doesn't script as nicely as zfs get
ZFS Tutorial USENIX LISA’11 134
136. Getting Properties
• zpool get all poolname
• nmc: show volume volumename property
propertyname
• zpool get propertyname poolname
• zfs get all dataset-name
• nmc: show folder foldername property
• nmc: show zvol zvolname property
ZFS Tutorial USENIX LISA’11 135
140. Pool Properties
Property Change? Brief Description
altroot Alternate root directory (ala chroot)
autoexpand Policy for expanding when vdev size changes
autoreplace vdev replacement policy
available readonly Available storage space
bootfs Default bootable dataset for root pool
Cache file to use other than /etc/zfs/
cachefile
zpool.cache
capacity readonly Percent of pool space used
delegation Master pool delegation switch
failmode Catastrophic pool failure policy
ZFS Tutorial USENIX LISA’11 139
141. More Pool Properties
Property Change? Brief Description
guid readonly Unique identifier
health readonly Current health of the pool
listsnapshots zfs list policy
size readonly Total size of pool
used readonly Amount of space used
version readonly Current on-disk version
ZFS Tutorial USENIX LISA’11 140
142. Common Dataset Properties
Property Change? Brief Description
available readonly Space available to dataset & children
checksum Checksum algorithm
compression Compression algorithm
Compression ratio – logical
compressratio readonly
size:referenced physical
copies Number of copies of user data
creation readonly Dataset creation time
dedup Deduplication policy
logbias Separate log write policy
mlslabel Multilayer security label
origin readonly For clones, origin snapshot
ZFS Tutorial USENIX LISA’11 141
143. More Common Dataset Properties
Property Change? Brief Description
primarycache ARC caching policy
readonly Is dataset in readonly mode?
referenced readonly Size of data accessible by this dataset
Minimum space guaranteed to a
refreservation dataset, excluding descendants
(snapshots & clones)
Minimum space guaranteed to dataset,
reservation
including descendants
secondarycache L2ARC caching policy
sync Synchronous write policy
Type of dataset (filesystem, snapshot,
type readonly
volume)
ZFS Tutorial USENIX LISA’11 142
144. More Common Dataset Properties
Property Change? Brief Description
used readonly Sum of usedby* (see below)
usedbychildren readonly Space used by descendants
usedbydataset readonly Space used by dataset
Space used by a refreservation for
usedbyrefreservation readonly
this dataset
Space used by all snapshots of this
usedbysnapshots readonly
dataset
Is dataset added to non-global zone
zoned readonly
(Solaris)
ZFS Tutorial USENIX LISA’11 143
145. Volume Dataset Properties
Property Change? Brief Description
shareiscsi iSCSI service (not COMSTAR)
volblocksize creation fixed block size
volsize Implicit quota
Set if dataset delegated to non-global
zoned readonly
zone (Solaris)
ZFS Tutorial USENIX LISA’11 144
146. File System Properties
Property Change? Brief Description
ACL inheritance policy, when files or
aclinherit
directories are created
ACL modification policy, when chmod is
aclmode
used
atime Disable access time metadata updates
canmount Mount policy
Filename matching algorithm (CIFS client
casesensitivity creation
feature)
devices Device opening policy for dataset
exec File execution policy for dataset
mounted readonly Is file system currently mounted?
ZFS Tutorial USENIX LISA’11 145
147. More File System Properties
Property Change? Brief Description
export/ File system should be mounted with non-blocking
nbmand
import mandatory locks (CIFS client feature)
normalization creation Unicode normalization of file names for matching
quota Max space dataset and descendants can consume
recordsize Suggested maximum block size for files
Max space dataset can consume, not including
refquota
descendants
setuid setuid mode policy
sharenfs NFS sharing options
sharesmb Files system shared with CIFS
ZFS Tutorial USENIX LISA’11 146
148. File System Properties
Property Change? Brief Description
snapdir Controls whether .zfs directory is hidden
utf8only creation UTF-8 character file name policy
vscan Virus scan enabled
xattr Extended attributes policy
ZFS Tutorial USENIX LISA’11 147
149. Forking Properties
Pool Properties
Release Property Brief Description
illumos comment Human-readable comment field
Dataset Properties
Release Property Brief Description
Solaris 11 encryption Dataset encryption
Delphix/illumos clones Clone descendants
Delphix/illumos refratio Compression ratio for references
Solaris 11 share Combines sharenfs & sharesmb
Solaris 11 shadow Shadow copy
NexentaOS/illumos worm WORM feature
Amount of data written since last
Delphix/illumos written
snapshot
ZFS Tutorial USENIX LISA’11 148
151. Dataset Space Accounting
• used = usedbydataset + usedbychildren + usedbysnapshots +
usedbyrefreservation
• Lazy updates, may not be correct until txg commits
• ls and du will show size of allocated files which includes all
copies of a file
• Shorthand report available
$ zfs list -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
rpool 126G 18.3G 0 35.5K 0 18.3G
rpool/ROOT 126G 15.3G 0 18K 0 15.3G
rpool/ROOT/snv_106 126G 86.1M 0 86.1M 0 0
rpool/ROOT/snv_b108 126G 15.2G 5.89G 9.28G 0 0
rpool/dump 126G 1.00G 0 1.00G 0 0
rpool/export 126G 37K 0 19K 0 18K
rpool/export/home 126G 18K 0 18K 0 0
rpool/swap 128G 2G 0 193M 1.81G 0
ZFS Tutorial USENIX LISA’11 150
152. Pool Space Accounting
• Pool space accounting changed in b128, along with
deduplication
• Compression, deduplication, and raidz complicate pool
accounting (the numbers are correct, the interpretation is
suspect)
• Capacity planning for remaining free space can be challenging
$ zpool list zwimming
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
zwimming 100G 43.9G 56.1G 43% 1.00x ONLINE -
ZFS Tutorial USENIX LISA’11 151
153. zfs vs zpool Space Accounting
• zfs list != zpool list
• zfs list shows space used by the dataset plus space for
internal accounting
• zpool list shows physical space available to the pool
• For simple pools and mirrors, they are nearly the same
• For RAIDZ, RAIDZ2, or RAIDZ3, zpool list will show space
available for parity
Users will be confused about reported space available
ZFS Tutorial USENIX LISA’11 152
155. Accessing Snapshots
• By default, snapshots are accessible in .zfs directory
• Visibility of .zfs directory is tunable via snapdir property
✦ Don't really want find to find the .zfs directory
• Windows CIFS clients can see snapshots as Shadow Copies
for Shared Folders (VSS)
# zfs snapshot rpool/export/home/relling@20090415
# ls -a /export/home/relling
…
.Xsession
.xsession-errors
# ls /export/home/relling/.zfs
shares snapshot
# ls /export/home/relling/.zfs/snapshot
20090415
# ls /export/home/relling/.zfs/snapshot/20090415
Desktop Documents Downloads Public
ZFS Tutorial USENIX LISA’11 154
156. Time Slider - Automatic Snapshots
• Solaris feature similar to OSX's Time Machine
• SMF service for managing snapshots
• SMF properties used to specify policies: frequency (interval)
and number to keep
• Creates cron jobs
• GUI tool makes it easy to select individual file systems
• Tip: take additional snapshots for important milestones to
avoid automatic snapshot deletion
Service Name Interval (default) Keep (default)
auto-snapshot:frequent 15 minutes 4
auto-snapshot:hourly 1 hour 24
auto-snapshot:daily 1 day 31
auto-snapshot:weekly 7 days 4
auto-snapshot:monthly 1 month 12
ZFS Tutorial USENIX LISA’11 155
157. Nautilus
• File system views which can go back in time
ZFS Tutorial USENIX LISA’11 156
158. Resilver & Scrub
• Can be read IOPS bound
• Resilver can also be bandwidth bound to the resilvering device
• Both work at lower I/O scheduling priority than normal
work, but that may not matter for read IOPS bound devices
• Dueling RFEs:
✦ Resilver should go faster
✦ Resilver should go slower
✤ Integrated in b140
ZFS Tutorial USENIX LISA’11 157
159. Time-based Resilvering
• Block pointers contain
birth txg number
73
• Resilvering begins with 73
oldest blocks first
73 55
• Interrupted resilver will still 73 27
result in a valid file system 27 27
68 73
view 73 68
Birth txg = 27
Birth txg = 68
Birth txg = 73
ZFS Tutorial USENIX LISA’11 158
160. ACL – Access Control List
• Based on NFSv4 ACLs
• Similar to Windows NT ACLs
• Works well with CIFS services
• Supports ACL inheritance
• Change using chmod
• View using ls
• Some changes in b146 to make behaviour more consistent
ZFS Tutorial USENIX LISA’11 159
161. Checksums for Data
• DVA contains 256 bits for checksum
• Checksum is in the parent, not in the block itself
• Types
✦ none
✦ fletcher2: truncated 2nd order Fletcher-like algorithm
✦ fletcher4: 4th order Fletcher-like algorithm
✦ SHA-256
• There are open proposals for better algorithms
ZFS Tutorial USENIX LISA’11 160
Hinweis der Redaktion
\n
\n
\n
\n
\n
\n
\n
\n
Remember this picture, it will help you make sense of it all\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Disabling the ZIL is a bad idea\n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Chickens and eggs at Richard's ranch\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
Notice the architecture gets simpler when we go for more speed. This is a speed vs cost trade-off.\n
\n
\n
\n
\n
\n
Notice the architecture gets simpler when we go for more speed. This is a speed vs cost trade-off.\n