2. Agenda
● Introduction/motivation
● ext4 – the new member of the extfs family
● Facts, specs
● Migration
● BTRFS – the newbie .. the hope
● Facts, specs
● Migration
● Summary
OSDC 2011 2
3. Linux file systems
● More than 50 file systems shipped with Linux
kernel
● Local
● Remote
● Cluster
● ...
● A few as standard for root directory
● ext2, ext3
● XFS
OSDC 2011 3
4. Linux file systems – challenges
● ReiserFS sun-setted
● Limitations of ext3
● Changes in recent Enterprise distributions
OSDC 2011 4
5. Linux file systems – new players
● New version of the ext family -> ext4
● Marked as stable
● Shipped with Enterprise distributions
● New approach with BTRFS
● Still experimental
● Default by some projects, e.g. MeeGo
OSDC 2011 5
6. th
4 extended file system
● Shipped since 2.6.19
● Stable since 2.6.28
● To overcome limits of ext3
● Size
● Performance
OSDC 2011 6
7. Ext4 - history
● Successor of ext3
● Started as set of patches for ext3
● Later forked
● First called ext3dev (sometimes ext4dev)
● Not impact ext3 stability
● Less dependencies to ext3 code
● Easier to maintain source code
OSDC 2011 7
8. Ext4 - facts
● Max volume size: 1 EByte = 1024 PByte
● Max file size: 16 TByte
● Max length of file name: 256 Bytes
● Support of extended attributes
● No encryption
● Not really compression
● Partially 64bit
OSDC 2011 8
9. Ext4 – starting from known
● Known tools
● mkfs
● fsck
● tune2fs
● e2label
OSDC 2011 9
10. Ext4 – global structure I
● Entry point -> superblock
● Block size
● Number of blocks and inodes
● Number of free blocks and inodes
● Disk divided in block groups
● backup of superblock
● Block group description (inode/block bitmaps)
OSDC 2011 10
11. Ext4 – global structure II
● Similar to ext3
● Inherits some ext3 limitations
● Number of inodes per block group
●
2nd type of block groups => flexible
● Flexible placement of bitmaps
● Bigger inodes to store additional information
● 256 Bytes
● Nano second time stamps
OSDC 2011 11
12. Ext4 – from blocks to extents
● Common addressing for modern file systems
● Contiguous area of blocks
● Less management information needed
● Less meta data operations
● Less “fragmentation”
● Requires change of on-disk format
OSDC 2011 12
13. Ext4 – extent I
● 15 bit for extent size
● Block size of 4 KByte => 128 MByte
● 1 bit for extent initialization information
struct ext4_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start_lo; /* low 32 bits of physical block */
};
OSDC 2011 13
14. Ext4 – extent II
● 32 bit for block addresses inside file
● Block size of 4 KByte => 16 TByte
● 48 (!) bit for block addresses of file system
● Block size of 4 KByte => 1 EByte
OSDC 2011 14
15. Ext4 – extent III
● 60 Byte for extent information
● 12 Byte for extent header
● 12 Byte for extent structure
– Up to 4 extents per inode
– max. 512 MByte direct addressable (ext3: 48 KByte)
– Different schema for bigger files
OSDC 2011 15
16. Ext4 – extent tree I
● For files > 512 MByte
● B+ tree
● Extent structure only at leaf nodes
● New element: extent index
● Same header structure like data extent
● Points to data block
● Data block contains either extent index or extent
structure
OSDC 2011 16
18. Ext4 – from extents to blocks
● At the end block allocation
● New features
● Multi-block allocation
● Delayed allocation
● Persistent allocation
OSDC 2011 18
19. Ext4 – multi-block allocation
● Ext3: only one block
● 12800 calls for 50 MByte file
● Ext4: multiple blocks per call
● Less overhead
● Contiguous physical location of data
OSDC 2011 19
20. Ext4 – delayed allocation
● Ext3
● Instant block allocation
● Fragmentation due to buffers and caches
● Ext4
● Delayed block allocation
● Use cache information for placement
● Risk of data loss in early versions => improved
since 2.6.30
OSDC 2011 20
21. Ext4 – “clever” allocation
● Support of system call fallocate()
● Application reserves blocks ahead
● File system ensures disk space availability
● Allocation information in extent structure
●
Remember 16th bit
OSDC 2011 21
22. Ext4 – consistent status
● New journaling => JBD2
● Transactions have checksums
● 64 bit ready
● Deactivation possible
OSDC 2011 22
23. Ext4 – repair
● Improved fsck()
● No check of unused blocks
– information stored in block group header
– Information secured via checksums
– (de)activation possible at any time
● First run as slow like in ext3
OSDC 2011 23
24. Ext4 – other news
● Nano second precision time stamps
● Unix millennium bug shifted to 2514
● More subdirectories
● Up to 65000
● More than 65000 ... with limitation
OSDC 2011 24
25. Ext4 – general migration paths
● mkfs() and backup/restore
● Clean new file system structure
● Only way for file systems other than ext2/3
● Extended outage
● Conversion via tune2fs
● Partial only
● Only possible for ext family
● Faster/easier
OSDC 2011 25
26. Ext4 – background for migration
● 2 kind of changes compared to ext3
● change of ondisk format:
– Extents
– Only enabled for new files via tune2fs
– Additional tasks needed
● Ondisk format not relevant
– block allocation
– Immediately enabled via tune2fs
OSDC 2011 26
27. Ext4 – migration via tune2fs
● Results in mix of ext3 and ext4 structure
● Access via ext3 driver impossible
● fsck() needed
parameter description
extent Extent based block allocation
flex_bg Flexible placement of meta data
uninit_bg Flag uninitialized blocks for faster fsck
dir_nlink Infinite number of sub directories
extra_isize Timestamps with nano seconds
OSDC 2011 27
28. Ext4 – migration hints
● fsck() recommended
● /boot – booting from ext4 possible?
● Rescue media enabled for ext4?
OSDC 2011 28
29. Ext4 – summary
● Good successor of ext3
● Manages higher amount of data
● Faster
● Performance
● recovery
● Safer
● Sufficient migration options from ext2/3
OSDC 2011 29
30. Better/b-tree file system
● Shipped since 2.6.29
● Still experimental
● Replace ext3/4
● New storage management approach
OSDC 2011 30
31. BTRFS - history
● Basic idea
● Shown 2007
● Usage of B trees for standard structures
● Not new ... see XFS, ReiserFS
● Chris Mason
● Worked on ReiserFS for SUSE
● Moved to Oracle -> started BTRFS developement
OSDC 2011 31
32. BTRFS - facts
● Max file/volume size: 16 EByte
● Max length of file name: 256 Bytes
● Support of
● Extended attributes
● Encryption
● Compression
● Snapshot
● Copy-on-Write
OSDC 2011 32
33. BTRFS – global structure
● Entry point -> superblock
● More than one file system per volume
● Extents
● Put together in block groups
● No mix of data and meta data
OSDC 2011 33
34. BTRFS – internals: the trees
● Consists of B+ trees
● Root tree
● File system tree
● Extent allocation tree
● Checksum tree
● Log tree
● Chunk & device tree
● Data relocation tree
OSDC 2011 34
35. BTRFS – internals: structures
● 3 structures
● Key
– index of the tree structure
● Block header
– ID of file system
– Reference of insert time
– Level position
● Item
– Different types: inodes, extents, directories
OSDC 2011 35
36. BTRFS – internals: the key
● Index of the tree structure
● Size: 136 bit
● First 64 bit: unique object ID
● Next 8 bit: type/item
● Last 64 bit: item dependent
● e.g. Hash of directory name
● e.g. Number of elements in directory
● e.g. object ID of upper layer directory
OSDC 2011 36
37. BTRFS – internals: the item
● More than one item per object ID possible
Item Value
INODE_ITEM 1
XATTR_ITEM 24
DIR_ITEM 84
DIR_INDEX 96
EXTENT_DATA 108
EXTENT_CSUM 128
ROOT_ITEM 132
EXTENT_ITEM 168
OSDC 2011 37
38. BTRFS – more about trees
● Highest layer
● Root tree
● Referenced in superblock
● Other trees => object ID in root tree
● Some trees unique
● Extent allocation
● Data relocation
● Possibly multiple trees
● File system
OSDC 2011 38
39. BTRFS – file system tree
● Visible part
● Contains:
● Inode items
● Reference items
● No data of files
● See extents
● Exception: small files
OSDC 2011 39
40. BTRFS – extent allocation tree
● Space management
● Backward reference
● file system object
● Possibly multiple per extent
● Maybe move to extent data reference object
OSDC 2011 40
41. BTRFS – other trees
● Log tree
● Collects fsync() calls
● Journal of this kind of COW calls
● Checksum tree
● CRC32 checksums of data and meta data
● Chunk tree
● Manage devices: device item and chunk map item
● Device tree
● Counterpart of chunk tree
OSDC 2011 41
42. BTRFS – device management
● Included volume manager
● pool concept
● RAID-0 and RAID-1
● For data and meta data
● Not necessarily identical
● Chunk tree
● abstract from disk block
OSDC 2011 42
44. BTRFS – what else
● Transparent compression via zlib
● Support of POSIX ACL's
● Online grow/shrink
● Online add/removal of disks
● No fsck() tool (yet)
● Management tool evolution (btrfsctl -> btrfs)
OSDC 2011 44
45. BTRFS – migration I
● Via tool btrfs-convert
● du/df not fully BTRFS-aware
● In place from ext3/4
● Via libe2fs
● BTRFS meta data location flexible
● Old ext3/4 organized in snapshot
● Roll-back possible to date/time of conversion
OSDC 2011 45
47. BTRFS summary
● Still experimental
● Meets standard file systems requirements
● Bridges existing gaps
● e.g. snapshots
● easy migration from ext3/4 possible
● New approach to storage management
● e.g. included volume manager
OSDC 2011 47
48. Summary
● Improvement moving to ext4
● Safe switching to ext4
● In place migration from ext3 possible
● Future is BTRFS
● In place migration from ext3/4 to BTRFS
possible
OSDC 2011 48