Más contenido relacionado



  1. XFS and Other Journaling File Systems SANTA CLARA UNIVERSITY COEN 396 Network Storage Systems [Winter 2002]  Lilish Saki  Gordon Lui
  2. OVERVIEW  Journaling File systems and its relevance to NSS.  Journaling concept.  XFS design and specifications.  Other journaling file systems design.  JFS.  ReiserFS.  Ext3.  Summary – Comparison.  Conclusion.
  3. Journaling File systems and its relevance to NSS  Normally for any traditional file system like UFS, whenever a system restarts following an unexpected shutdown (for e.g. system crash ) it invokes one of the most common file system integrity test like fsck().  This integrity check ensures that all its internal data structures are correct and file system is consistent.  This check is not a big problem with small systems with regards to time spent.
  4. Journaling File systems and its relevance to NSS (contd.)  However, for large servers with large file systems with hundreds of gigabytes sometimes terabytes – typically found in storage networking environments, this process can take several hours to run.  This unavailability of data can be very expensive when end users or applications are waiting for this data to be made available to get work done.  To overcome this problem, journaling or journaled file systems were introduced.
  5. Journaling File systems and its relevance to NSS (contd.)  File system maintains a journal file, or files that track the status of write operations in the file system.  A system with this kind of file system can come up quickly in matter of seconds, after unexpected shutdown.  System availability with this kind of FS compared to non-journaled file system greatly improves and reduces expenses.  Some examples – XFS, ReiserFS, Ext3fs, JFS.
  6. What is Journaling ?  Journaling concept is similar to database systems in which system keeps records or its internal status.  One major difference between databases and file systems journaling is that databases log users and control data, while file systems tend to log metadata only. Metadata are the control structures inside a file system: i-nodes, free block allocation maps, i-nodes maps, etc.  Before file system driver makes any changes to the meta-data, Journaled file system copies the command for all write I/O operations occurring in a file to a separate system journal file that describes what it's about to do. Then, it goes ahead and modifies the meta-data.
  7. Journaling in action. Write Held in host cache 1 Write flushed from cache 3 2 Journal written to storage File System  The process of writing the journal and writing the data.
  8. Journaling in action (contd.)  When the filesystem is mounted, the filesystem driver checks to see whether the filesystem is OK. If for some reason it isn't, then the meta-data needs to be fixed, but instead of performing an exhaustive meta-data scan (like fsck) it instead takes a look at the journal. Read from Journal in storage Verify Journal with data structure in storage 1 2 File system
  9. Journaling in action (contd.)  Since the journal contains a chronological log of all recent meta-data changes, it simply inspects those portions of the meta-data that have been recently modified.  This process is much faster than running a complete file system data structure analysis.  Thus, the system can come up in few seconds and availability thus greatly improves compared to non- journaled systems.  Understanding Journaling file systems - in addition to storing data (your stuff) and meta-data (the data about the stuff), they also have a journal, which you could call meta-meta-data (the data about the data about the stuff).
  10. Overview of XFS  XFS a 64 bit Journaled file system was introduced in 1994 by silicon graphics Inc., (SGI) for their system-V based version of Unix.  It was introduced due to increase in demand for large disk capacity and bandwidth. Demands also included fast crash recovery, support for large file systems, directories with large numbers of files.  XFS is also available for Linux as open source XFS, licensed under GPL.
  11. Features of XFS  Highly scalable 64-Bit file system.  18000 Petabytes file system size. (1Pb = 10^6 Gb).  9000 Petabytes File size.  Asynchronous Journaled (No fsck).  Designed around Transaction/log.  Restarts after crash in seconds.  B+ Tree (Balanced tree) design of directory entries, meta data free list, Extent list within file.  Filenames converted to four byte hash value used to index the directory.  Directory searching extremely fast.
  12. Features of XFS (Contd.)  Extent based.  Extents are sets of contiguous logical blocks.  The extent descriptor is having three components namely- beginning, extent size and offset.  Reduce amount of disk space required to free disk blocks.  Extent size from 512 bytes to 1 GB.  Support for sparse file.  The sparse file support is related to the extent addressing technique.
  13. Features of XFS (Contd.)  whenever the file system must look for free blocks just to fill the gaps the file system just sets up a new extent with the corresponding “offset within the file” field.  Dynamic allocation of disk blocks with I- nodes.  Free space usage becomes efficient.  Parallelism achieved through partitioned regions called - allocation groups (AG).  Manages its own free space and I-nodes.
  14. Features of XFS (Contd.)  Supports Guaranteed Rate I/O (GRIO).  which allows applications to reserve bandwidth to or from the file system. XFS calculates the performance available and guarantees that the requested level of performance is met for a specified time.  This functionality useful for full rate, high- resolution media delivery systems such as video-on-demand or satellite systems that need to process information at a certain rate.  NFS v 3.0 compatibility.
  15. XFS ARCHITECTURE Disk Drivers Volume Manager Buffer cache Transaction Manager Space Manager I/O Manager Directory Manager System call Interface
  16. XFS Architecture.  Though Modular implementation – Very large and complex.  High level structure similar to traditional file system with the addition of a volume manager and a transaction manager.  Supports standard Unix file interfaces and is POSIX compliant.  Transaction manager is used by other pieces of file system to make all updates to the metadata of file system atomic.  The volume manager provides abstraction between XFS and its underlying disk devices.
  17. XFS - Asynchronous log /transactions  Transaction – collection of meta data changes.  Single logical file system operation.  After each transaction, FS is consistent.  XFS log has two parts.  In-core log buffers (from 2 to 8).  On-disk buffers ( always written, never read), its circular buffer (cycle/block no.).  XFS journals metadata by first writing to in- core log buffers then asynchronously writing the log buffers to on-disk log.
  18. XFS - Asynchronous Log /Transactions (contd.)  After crash, the on-disk log is called by recovery code which called by mount.  XFS metadata modifications use transactions.  Create,remove, link, unlink, allocate, truncate, rename operations all require transactions.  Transactions committed to in-core log buffers.  One major aspect of journaling is write ahead logging.  Metadata are pinned in kernel memory while transaction is committed to on-disk log.  Metadata is unpinned once the in-core log is written to on-disk log.
  19. XFS - Asynchronous Log/Transactions (Contd.)  XFS gains two things by writing the log asynchronously.  Multiple updates can be batched into a single log write.  increases the efficiency of the log writes with respect to the underlying disk array.  performance of metadata updates is made independent of the speed of the underlying drives.  In situations where metadata updates are very intense, the log can be stored on a separate device such as a dedicated disk.  useful when a file system is exported via NFS, which requires synchronous transactions.
  20. JFS  IBM's Journaled File System(JFS) is a journaling file system used in its enterprise servers.  It is log-based, byte-level file system that was developed for transaction-oriented, high performance systems.  JFS is being developed under GNU public license to port it completely to Linux operating system.  Primarily for the high throughput and reliability requirements of servers (single processor to multiprocessor and clustered systems).  JFS is also applicable to client configurations where performance and reliability are desired.
  21. Features of JFS  Internal JFS (potential) limits.  All file system structure fields are 64-bits in size.  This allows JFS to support both large files and partitions.  File System size.  The minimum file system size supported by JFS is 16 Mbytes.  The maximum file system size is a function of the file system block size and the maximum number of blocks supported by the file system meta-data structures.  JFS support a maximum file size of 512 terabytes (10^3 GB) - with block size 512 bytes to 4 Petabytes -with block size 4 Kbytes.
  22. Features of JFS (contd.)  File Size.  The maximum file size is the largest file size that virtual file system framework supports.  For example, if the frame work only supports 32-bits, then this limits the file size.  Journaling to restore a file system to a consistent state in a matter of seconds.  Database concept of transaction logging.  Logging is not particularly effective in the face of media errors.  This implies that bad block relocation is a key feature of any storage manager or device residing below JFS.
  23. Features of JFS (Contd.)  Variable Block size.  Block sizes 512, 1024, 2048 and 4096 bytes.  allowing users to optimize space utilization based on their application environment.  Dynamic disk node allocation.  Allocate/free disk I-nodes as required.  avoids the traditional approach of reserving a fixed amount of space for disk inodes at the file system creation time.  Decouples disk I-nodes from fixed locations.
  24. Features of JFS (Contd.)  Performance.  Extent based addressing structure.  Results in compact, efficient mapping of logical offsets within files to physical addresses on disk.  B+ tree populated with extent descriptors.  B+ tree use throughout JFS.  Reading and writing extents.  Directories entries sorted by name.  File layout.  Sparse and dense file support.  Sparse files reduce blocks written to disk.  Dense file allocation covers the complete file size.
  25. JFS Architecture and design  The JFS architecture can be explained in the context of its disk layout characteristics.  Logical volumes.  Physical disk or some subset of the physical disk space such as an FDISK partition. A logical volume is also known as a disk partition.  Aggregates and file sets.  Array of disk blocks containing a specific format that includes a super block and an allocation map.  Format includes the initial file set and control structures necessary to describe it. The file set is the mountable entity.
  26. JFS Architecture and design (contd.)  Files, directories, inodes, and addressing structures.  A file set contains files and directories. Files and directories are represented persistently by inodes.  I-nodes used to represent other file system objects, such as the map that describes the allocation state and location on disk of each I-node in the file set.  Directories map user-specified names to the inodes allocated for files and directories.  Form traditional hierarchy.  Together, the aggregate super block, disk allocation map, file descriptor and I-node map, inodes, directories, and addressing structures represent JFS control structures or meta-data.
  27. Journaling – JFS Logging.  Journaling.  Logging style improved as Asynchronous Journaling of meta data only.  Does not log file data or recover this data to consistent state. Thus, some file data may be lost or stale after recovery.  Journaling design- layout of log.  Circular link list of transaction “block”.  In memory.  Written to disk – location of log found by super block.  Log-redo.  Replay all transactions committed since the most recent synch point.  Super block is read first.
  28. ReiserFS.  ReiserFS 3.6.x is designed and developed by Hans Reiser and his team of developers at Namesys.  Goal is to have single shared environment, or namespace in the file system, where applications can interact more directly, efficiently and powerfully.  Initially Namesys focused on one aspect of the file system - small file performance.  ReiserFS ver 4.0 being developed primarily sponsored by DARPA.  Due in September 2002.
  29. Features of ReiserFS  ReiserFS stores all file system objects in a single B* tree (enhanced version of B+ tree).  The main difference is that every file system object is placed within a single B*Tree.  There aren't different trees for each directory, but each directory has a sub-tree of the main file system one.  Hashing techniques are used to obtain the key field needed to organize items within a B*Tree.  The tree supports.  Dynamic I-node allocation.  Compact, indexed directories.  Resizable items.  60-bit offsets.
  30. Features of ReiserFS  Small File performance.  Performance increase due to tree structure and dynamic I-node allocation like others.  ReiserFS stores files inside the b*tree leaf nodes themselves, rather than storing the data somewhere else on the disk and pointing to it.  Large file support.  Max file system size - 16 TB With 4 GB of blocks.  Sparse file support.  supports sparse files but not that fast.  Free block management.  Bit maps.
  31. Features of ReiserFS (Contd.)  Extent support.  Not supported but will be supported in version 4. ReiserFS version 4 Features.  modular, high performance journaling file system strengthened against attack.  focuses on extensibility via plugins for files, directories, Hash, security, Node search and Item search plug in, Key assignment plugin.  Security enhanced with mechanisms like aggregation plugins, auditing plugins etc.
  32. Features of ReiserFS ver. 4.0 (contd.)  Would employ “Dancing trees” instead of balance trees.  These trees merge insufficiently full nodes not with every modification to the tree, but instead:  in response to memory pressure triggering a commit,  when an insertion into an internal node presents a danger of needing to split the internal node.  Use of Repacker.  For space efficiency.
  33. ReiserFS (3.xx) Journaling  The ReiserFS journal uses a simple metadata- only, write-ahead logging scheme.  In this before any changes are written to disk, they are first committed to a log.  After a crash, committed transactions are replayed, just like copying blocks from the log into the main disk area.  It is common for blocks to be logged over and over again.  Thus total number of writes needed is lower, and most of the writes are to the sequential log.  ReiserFS stores everything in a balanced tree, hence the tree frequently needs balancing.
  34. ReiserFS (3.xx) Journaling (Contd.)  Tree blocks are allocated, modified and then freed in another balance later on.  With larger transactions, block can be freed before it is written to the log or the main disk.  Generally, log I/O is done by a worker thread, kreiserfsd.  This allows log commits to happen in the background, without slowing down user processes.  However, the log is a fixed size, so user processes might have to wait for log space to become available before they can start a new transaction.
  35. EXT3  The Linux - ext3 Journaling file system is a set of incremental enhancements to the ext2.  Max file system size – 4 TB.  EXT2 and EXT3 use identical metadata, in- place ext2 to ext3 file system upgrades possible.  Being add-on to ext2fs has the drawback advanced optimization techniques employed in the other journaling file systems are unavailable.  no balanced trees, no extents for free space, etc.
  36. EXT3 Journaling  EXT3 handles journaling very differently than ReiserFS and other journaling file systems do.  With ReiserFS, XFS, and JFS, the file system driver journals ‘metadata’, but makes no provisions for journaling ‘data.’  Metadata would remain solid with those kind of FS.  There is possibility, However that unexpected system lock-ups can result in corruption of recently-modified data.  EXT3 approach.  The journaling code uses a special API called the Journaling Block Device layer, or JBD.  JBD manages the journal on behalf of the ext3 file system driver.
  37. EXT3 Journaling  JBD uses physical journaling, which means that the JBD uses complete physical blocks as the underlying unit for implementing the journal.  Thus ext3 journal will have a larger relative on- disk footprint than, say an XFS journal.  Both metadata and data journaling (data=journal).  avoiding the data corruption problem.  drawback of full data journaling is that it can be slow.
  38. EXT3 Journaling.  Journaling meta data only (data=ordered).  ext3 officially journals metadata, but it logically groups metadata and data blocks into a single unit called a transaction.  data blocks are written to disk first. Once they are written, the metadata changes are then written to the journal. Thus this mode provides data and metadata consistency.  Data = Write back mode.  Doesn't do any form of data journaling at all, providing similar journaling found in the XFS, JFS, and ReiserFS file systems (metadata only).  Better file system performance.
  39. Summary – Comparison. File System Free Block Mgmt. Extent For Free Space B trees For directories Extents For File Block Addressing Dynamic I-node Allocation Sparse File Support XFS B+ Tree Indexed by Offset and size YES YES YES YES YES JFS Tree + Binary Buddy NO YES YES YES YES ReiserFS Bitmap Not supported As sub-tree of main FS tree Within file system Tree YES YES Ext3 Ext3 doesn’t support any of these, it lies over ext2fs, It does provide journaling support. NO NA
  40. Conclusion.  With the ever increasing demand for storage, journaling file systems are becoming very important.  Every type of file system discussed have some advantages and disadvantages.  XFS and JFS have proven records for high end servers.  Port to open source Linux will eventually benefit the industry.  ReiserFS gives high performance for small files and version 4 will increases security.  Ext3 of Linux has advantage of upgrading from Ext2 the file system without backup and data journaling.
  41. References:  Linux Journal File system by J. Florido, Linux Gazette. Article  XFS file system  White paper of XFS – 1996 USENIX Conference at  XFS presentation by Jim Mostek of SGI October 1999.  Earthweb Networking and communications – XFS its worth the wait article by Vincent Danen. at,,12284_623661,00.html.   JFS overview by Steve best, IBM January 2000 at -.   Reiser FS  Advanced file system implementer's guide series at. bynewest?OpenDocument&Count=500.  Journaling for Reiser FS by Chris Manon Feb, 2001, Article at  Article by Philip tomsich on Journaling file systems at  Scalability in the XFS File system - Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck - Silicon Graphics, Inc. January, 1996 USENIX conference.  White paper on Red Hat new Journaling file system Ext3, by Michael K Johnson.