如果发生系统崩溃,JFS 提供了快速文件系统重启。通过使用数据库日志技术,JFS 能在几秒或几分钟之内把文件系统恢复到一致状态,而非日志文件系统却要花上几小时甚至几天才能完成。本白皮书对 JFS 体系结构作了概述,并且描述了可在 developerWorks 网站上找到的 JFS 技术的设计特性、潜在限制以及管理实用程序。
日志文件系统 (JFS) 提供了基于日志的字节级文件系统,该文件系统是为面向事务的高性能系统而开发的。它具有可伸缩性和健壮性,与非日志文件系统相比,它的优点是其快速重启能力:JFS 能够在几秒或几分钟内就把文件系统恢复到一致状态。
虽然 JFS 主要是为满足服务器(从单处理器系统到高级多处理器和群集系统)的高吞吐量和可靠性需求而设计的,JFS 还可用于想得到高性能和可靠性的客户机配置。
体系结构和设计
JFS 体系结构可从磁盘布局特性的角度进行说明。
逻辑卷
所有文件系统讨论的基础是某种类型的逻辑卷。这可以是一个物理磁盘,或物理磁盘空间的某个子集,例如:一个 FDISK 分区。逻辑卷也称为磁盘分区。
聚集和文件集
文件系统创建实用程序 mkfs,创建了完全包含在分区内的聚集。聚集是包含一种特定格式的磁盘块阵列,其格式包括超级块和分配映射表。超级块将分区标识成 JFS 聚集,而分配映射表描述聚集内每个数据块的分配状态。格式还包括描述它所必需的初始文件集和控制结构。文件集是可安装的实体。
文件、目录、inode 与寻址结构
文件集包含文件和目录。文件和目录由 inode 持续表示;每个 inode 描述文件或目录的属性,并作为查找磁盘上文件或目录数据的起始点。JFS 还使用 inode 来表示其它文件系统对象,如描述文件集中每个 inode 的分配状态和磁盘位置的映射表。
目录将用户特定的名称映射到为文件和目录所分配的 inode 上,并且形成传统的命名层次。文件包含用户数据,用户数据中没有隐含任何限制或格式。也就是说,JFS 将用户数据看成是未解释的字节流。根植于 inode 基于盘区的寻址结构用来将文件数据映射到磁盘。聚集超级块和磁盘分配映射表、文件描述符和 inode 映射表、inode、目录以及寻址结构一起表示了 JFS 控制结构或元数据。
日志
在每个聚集中维护 JFS 日志,并且用来记录元数据的操作信息。日志有一种同样由文件系统创建实用程序设置的格式。聚集内多个安装的文件集可以同时使用一个日志。
设计特性
JFS 从一开始就设计成完全集成了日志记录,而不是在现有文件系统上添加日志记录。JFS 的许多特性使之区别于其它文件系统。
日志处理
JFS 提供了改进的结构化一致性和可恢复性,以及比非日志文件系统(例如:HPFS、ext2 和传统 UNIX 文件系统)快得多的系统重启时间。发生系统故障时非日志文件系统容易崩溃,是由于一个逻辑写文件操作通常占用多个媒体 I/O 来完成,且在任何给定时间,可能没有完全反映在媒体上。这些文件系统依靠重启实用程序(也就是 fsck),fsck 检查文件系统的所有元数据(例如:目录和磁盘寻址结构)以检测和修复结构完整性问题。这是一个耗时并且容易出错的过程,在最糟糕的情况下,它还可能丢失或放错数据。
相反,JFS 使用原来为数据库开发的技术,记录了文件系统元数据上执行的操作(即原子事务)信息。如果发生系统故障,可通过重放日志并对适当的事务应用日志记录,来使文件系统恢复到一致状态。由于重放实用程序只需检查文件系统最近活动所产生的运行记录,而不是检查所有文件系统的元数据,因此,与这种基于日志的方法相关的文件系统恢复时间要快得多。
基于日志恢复的其它几个方面也值得注意。首先,JFS 只记录元数据上的操作,因此,重放这些日志只能恢复文件系统中结构关系和资源分配状态的一致性。它没有记录文件数据,也没有将这些数据恢复到一致状态。因此,恢复后某些文件数据可能丢失或失效,对数据一致性有关键性需求的用户应该使用同步 I/O。
面对媒体出错,日志记录不是特别有效。特别地,在将日志或元数据写入磁盘的期间发生的 I/O 错误,意味着在系统崩溃后,要将文件系统恢复到一致状态,需要耗时并且有可能强加的全面完整性检查。这暗示着,坏块重定位是任何驻留在 JFS 下的存储管理器或设备的一个关键特性。
JFS 日志记录的语义如下:当涉及元数据更改的文件系统操作--例如,unlink()--返回成功执行的返回码时,操作的结果已经提交到文件系统,即使系统崩溃了也可以发现。例如,一旦成功删除了文件,即使系统崩溃然后重启,它仍然是删除的并且不会再重新出现。
日志记录风格将同步写入日志磁盘引入每个修改元数据的 inode 或 vfs 操作。(对数据库专家而言,这是一种使用非剥夺缓冲区策略的仅重做的、物理残留映象、提前写的日志记录协议。)在性能方面,与依赖(多个)谨慎的同步元数据写操作以获得一致性的许多非日志文件系统相比,这种方法较好。但是,与其它日志文件系统相比,它在性能上处于劣势。其它日志文件系统,如 Veritas VxFS 和 Transarc Episode,使用不同的日志风格并且缓慢地将日志数据写入磁盘。在执行多个并行操作的服务器环境中,通过将多个同步写操作组合成单一写操作的组提交来减少这种性能损失。JFS 日志记录风格随着时间推移而得到不断改进,现在提供了异步日志记录,异步日志记录提高了文件系统的性能。
基于盘区的寻址结构
JFS 使用基于盘区的寻址结构,连同主动的块分配策略,产生紧凑、高效、可伸缩的结构,以将文件中的逻辑偏移量映射成磁盘上的物理地址。盘区是象一个单元那样分配给文件的相连块序列,可用一个由 <逻辑偏移量,长度,物理地址> 组成的三元组来描述。寻址结构是一棵 B+ 树,该树由盘区描述符(上面提到的三元组)填充,根在 inode 中,键为文件中的逻辑偏移量。
可变的块尺寸
按文件系统分,JFS 支持 512、1024、2048 和 4096 字节的块尺寸,以允许用户根据应用环境优化空间利用率。较小的块尺寸减少了文件和目录中内部存储碎片的数量,空间利用率更高。但是,小块可能会增加路径长度,与使用大的块尺寸相比,小块的块分配活动可能更频繁发生。因为服务器系统通常主要考虑的是性能,而不是空间利用率,所以缺省块尺寸为 4096 字节。
动态磁盘 inode 分配
JFS 按需为磁盘 inode 动态地分配空间,同时释放不再需要的空间。这一支持避开了在文件系统创建期间,为磁盘 inode 保留固定数量空间的传统方法,因此用户不再需要估计文件系统包含的文件和目录最大数目。另外,这一支持使磁盘 inode 与固定磁盘位置分离。
目录组织
JFS 提供两种不同的目录组织。第一种组织用于小目录,并且在目录的 inode 内存储目录内容。这就不再需要不同的目录块 I/O,同时也不再需要分配不同的存储器。最多可有 8 个项可直接存储在 inode 中,这些项不包括自己(.)和父(..)目录项,这两个项存储在 inode 中不同的区域内。
第二种组织用于较大的目录,用按名字键控的 B+ 树表示每个目录。与传统无序的目录组织比较,它提供更快的目录查找、插入和删除能力。
稀疏和密集文件
按文件系统分,JFS 既支持稀疏文件也支持密集文件。
稀疏文件允许把数据写到一个文件的任意位置,而不要将以前未写的中间文件块实例化。所报告的文件大小是已经写入的最高块位处,但是,在文件中任何给定块的实际分配,只有在该块进行写操作时才发生。例如,假设在一个指定为稀疏文件的文件系统中创建一个新文件。应用程序将数据块写到文件中第 100 块。尽管磁盘空间只分配了 1 块给它,JFS 将报告该文件的大小为 100 块。如果应用程序下一步读取文件的第 50 块,JFS 将返回填充了 0 的一个字节块。假设应用程序然后将一块数据写到该文件的第 50 块,JFS 仍然报告文件的大小为 100 块,而现在已经为它分配了两块磁盘空间。稀疏文件适合需要大的逻辑空间但只使用这个空间的一个(少量)子集的应用程序。
对于密集文件,将分配相当于文件大小的磁盘资源。在上例中,第一个写操作(将一块数据写到文件的第 100 块)将导致把 100 个块的磁盘空间分配给该文件。在任何已经隐式写入的块上进行读操作,JFS 将返回填充了 0 的字节块,正如稀疏文件的情况一样。
JFS 内部(潜在)限制
JFS 是完全 64 位的文件系统。所有 JFS 文件系统结构化字段都是 64 位大小。这允许 JFS 同时支持大文件和大分区。
文件系统大小
JFS 支持的最小文件系统是 16M 字节。最大文件系统的大小是文件系统块尺寸和文件系统元数据结构支持的最大块数两者的乘积。JFS 将支持最大文件长度是 512 万亿字节(TB)(块尺寸是 512 字节)到 4 千万亿字节(PB)(块尺寸是 4K 字节)
文件长度
最大文件长度是主机支持的虚拟文件系统最大文件长度。例如:如果主机只支持 32 位,则这就限制了文件长度。
可移动媒体
JFS 不支持把软盘作为基本文件系统设备。
标准管理实用程序
JFS 提供创建和维护文件系统的标准管理实用程序。
创建文件系统
这个实用程序提供 mkfs 命令的 JFS 特定部分,用来在指定的驱动器上初始化 JFS 文件系统。该实用程序在较低级别上操作,并假设文件系统所存在的任何卷的创建/初始化由更高级别的另一个实用程序处理。
检查/修复文件系统
这个实用程序提供 fsck 命令的 JFS 特定部分。该命令检查文件系统的一致性,修复发现的问题。它也重放日志,把提交的改动应用到文件系统元数据,如果由于日志重放而声明文件系统是干净的,就不会再采取进一步操作。如果文件系统不认为是干净的,这意味着由于某种原因没有完整和正确地重放日志,或者文件系统不能单靠重放日志来恢复到一致状态,那么,就对文件系统执行一遍完整检查。
当执行全部完整性检查时,检查/修复实用程序首要目的是要达到可靠的文件系统状态,以防止将来文件系统崩溃或故障,第二个目的就是面对崩溃时保存数据。这意味着为了达到文件系统的一致性,实用程序可能丢弃数据。具体而言,当实用程序在不做假设的情况下,无法获得所需信息以将结构上不一致的文件或目录恢复到一致状态时,就会废弃数据。当遇到不一致的文件或目录时,就废弃整个文件或目录,而不再试图保存任何部分。任何由删除受损目录所孤立起来的文件或子目录,都放在文件系统根下的 lost+found 目录中。
文件系统检查/修复实用程序重点考虑的因素之一是所需虚存数量。通常,这些实用程序所需的虚存数量由文件系统的大小决定,这是由于所需虚存主要用于跟踪文件系统中个别块的分配状态。随着文件系统增大,块的数量增多,用来跟踪这些块所需的虚存数量也随之增加。
JFS 检查/修复实用程序的设计区别在于其虚存需求由文件系统中文件和目录的数量(而不是由块的数量)所决定。对 JFS 检查/修复实用程序而言,每个文件或目录的虚存大约为每个文件或目录 32 字节,或者对于包含百万个文件和目录的文件系统而言,不论其文件系统大小,虚存需求都是大约 32 兆字节。如同所有其它的文件系统,JFS 实用程序需要跟踪块分配状态,但避免使用虚存方法,而是使用位于实际文件系统中的一小块保留工作区来实现。
结束语
因为在系统崩溃时,JFS 能提供快速文件系统重启时间,所以它是因特网文件服务器的关键技术。使用数据库日志处理技术,JFS 能在几秒或几分钟之内把文件系统恢复到一致状态。而在非日志文件系统中,文件恢复可能花费几小时或几天。大部分文件服务器用户不能容忍与非日志文件系统相关的停机时间。只有通过转移到日志技术,这些文件系统才能避免需要检查文件系统的所有元数据才能验证文件系统或将其恢复到一致状态这一耗时的过程。
参考资料
* 您可以参阅本文在 developerWorks 全球站点上的 英文原文.
* JFS open source,在 developerWorks 网站上
* IBM makes JFS technology available for Linux,dW 特别报道
关于作者
Steve Best 在德克萨斯州奥斯汀 IBM Software Solutions & Strategy Division 工作,是文件系统开发部门的成员之一。Steve 以前从事操作系统上文件系统、国际化以及安全性领域的开发。目前,Steve 正从事将 JFS 移植到 Linux 的工作。可通过 sbest@us.ibm.com 与他取得联系。
sinxen 于 2009-07-25 23:14:40发表:
不太懂,只听说JFS是IBM开发的一个不亚于XFS的文件系统
chiver 于 2009-07-18 20:36:17发表:
有点难度
刘冲 于 2009-07-18 20:16:38发表:
JFS overview
. The addressing structure is a B+tree populated with extent descriptors (the triples above), rooted in the inode and keyed by logical offset within the file.
How the Journaled File System cuts system restart times to the quick
Steve Best (sbest@us.ibm.com)IBM
Steve Best works in the Software Solutions & Strategy Division of IBM in Austin, Texas as a member of the File System development department. Steve has worked on operating system development in the areas of the file system, internationalization, and security. Steve is currently working on the port of JFS to Linux. He can be reached at sbest@us.ibm.com.
Summary: JFS provides fast file system restart in the event of a system crash. Using database journaling techniques, JFS can restore a file system to a consistent state in a matter of seconds or minutes, versus hours or days with non-journaled file systems. This white paper gives an overview of the architecture, and describes design features, potential limits, and administrative utilities of the JFS technology available on developerWorks.
Date: 01 Jan 2000
Level: Introductory
Activity: 1343 views
Comments: 0 (Add comments)
1 star2 stars3 stars4 stars5 stars Average rating (based on 50 votes)
The Journaled File System (JFS) provides a log-based, byte-level file system that was developed for transaction-oriented, high performance systems. Scalable and robust, its advantage over non-journaled file systems is its quick restart capability: JFS can restore a file system to a consistent state in a matter of seconds or minutes.
While tailored primarily for the high throughput and reliability requirements of servers ( from single processor systems to advanced multi-processor and clustered systems), JFS is also applicable to client configurations where performance and reliability are desired.
Architecture and design
The JFS architecture can be explained in the context of its disk layout characteristics.
Logical volumes
The basis for all file system discussion is a logical volume of some sort. This could be a physical disk or some subset of the physical disk space such as an FDISK partition. A logical volume is also known as a disk partition.
Aggregates and filesets
The file system create utility, mkfs, creates an aggregate that is wholly contained within a partition. An aggregate is an array of disk blocks containing a specific format that includes a superblock and an allocation map. The superblock identifies the partition as a JFS aggregate, while the allocation map describes the allocation state of each data block within the aggregate. The format also includes the initial fileset and control structures necessary to describe it. The fileset is the mountable entity.
Files, directories, inodes, and addressing structures
A fileset contains files and directories. Files and directories are represented persistently by inodes; each inode describes the attributes of the file or directory and serves as the starting point for finding the file or directory's data on disk. JFS also uses inodes to represent other file system objects, such as the map that describes the allocation state and location on disk of each inode in the fileset.
Directories map user-specified names to the inodes allocated for files and directories and form the traditional name hierarchy. Files contain user data, and there are no restrictions or formats implied in the data. That is, user data is treated, by JFS, as an uninterpreted byte stream. Extent-based addressing structures rooted in the inode are used for mapping file data to disk. Together, the aggregate superblock and disk allocation map, file descriptor and inode map, inodes, directories, and addressing structures represent JFS control structures or meta-data.
Logs
JFS logs are maintained in each aggregate and used to record information about operations on meta-data. The log has a format that also is set by the file system creation utility. A single log may be used simultaneously by multiple mounted filesets within the aggregate.
Back to top
Design features
JFS was designed to have journaling fully integrated from the start, rather than adding journaling to an existing file system. A number of features in JFS distinguish it from other file systems.
Journaling
JFS provides improved structural consistency and recoverability and much faster restart times than non-journaled file systems such as HPFS, ext2, and traditional UNIX file systems. These other file systems are subject to corruption in the event of system failure since a logical write file operation often takes multiple media I/Os to accomplish and may not be totally reflected on the media at any given time. These file systems rely on restart-time utilities (that is, fsck), which examine all of the file system's meta-data (such as directories and disk addressing structures) to detect and repair structural integrity problems. This is a time-consuming and error-prone process, which, in the worst case, can lose or misplace data.
In contrast, JFS uses techniques originally developed for databases to log information about operations performed on the file system meta-data as atomic transactions. In the event of a system failure, a file system is restored to a consistent state by replaying the log and applying log records for the appropriate transactions. The recovery time associated with this log-based approach is much faster since the replay utility need only examine the log records produced by recent file system activity rather than examine all file system meta-data.
Several other aspects of log-based recovery are of interest. First, JFS only logs operations on meta-data, so replaying the log only restores the consistency of structural relationships and resource allocation states within the file system. It does not log file data or recover this data to consistent state. Consequently, some file data may be lost or stale after recovery, and customers with a critical need for data consistency should use synchronous I/O.
Logging is not particularly effective in the face of media errors. Specifically, an I/O error during the write to disk of the log or meta-data means that a time-consuming and potentially intrusive full integrity check is required after a system crash to restore the file system to a consistent state. This implies that bad block relocation is a key feature of any storage manager or device residing below JFS.
JFS logging semantics are such that, when a file system operation involving meta-data changes -- that is, unlink() -- returns a successful return code, the effects of the operation have been committed to the file system and will be seen even if the system crashes. For example, once a file has been successfully removed, it remains removed and will not reappear if the system crashes and is restarted.
The logging style introduces a synchronous write to the log disk into each inode or vfs operation that modifies meta-data. (For the database mavens, this is a redo-only, physical after-image, write-ahead logging protocol using a no-steal buffer policy.) In terms of performance, this compares well with many non-journaling file systems that rely upon (multiple) careful synchronous meta-data writes for consistency. However, it is a performance disadvantage when compared to other journaling file systems, such as Veritas VxFS and Transarc Episode, which use different logging styles and lazily write log data to disk. In the server environment, where multiple concurrent operations are performed, this performance cost is reduced by group commit, which combines multiple synchronous write operations into a single write operation. JFS logging style has been improved over time and now provides asynchronous logging, which increases performance of the file system.
Extent-based addressing structures
JFS uses extent-based addressing structures, along with aggressive block allocation policies, to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. An extent is a sequence of contiguous blocks allocated to a file as a unit and is described by a triple, consisting of
Variable block size
JFS supports block sizes of 512, 1024, 2048, and 4096 bytes on a per-file system basis, allowing users to optimize space utilization based on their application environment. Smaller block sizes reduce the amount of internal fragmentation within files and directories and are more space efficient. However, small blocks can increase path length since block allocation activities may occur more often than if a large block size were used. The default block size is 4096 bytes since performance, rather than space utilization, is generally the primary consideration for server systems.
Dynamic disk inode allocation
JFS dynamically allocates space for disk inodes as required, freeing the space when it is no longer required. This support avoids the traditional approach of reserving a fixed amount of space for disk inodes at the file system creation time, thus eliminating the need for users to estimate the maximum number of files and directories that a file system will contain. Additionally, this support decouples disk inodes from fixed disk locations.
Directory organization
Two different directory organizations are provided. The first organization is used for small directories and stores the directory contents within the directory's inode. This eliminates the need for separate directory block I/O as well as the need to allocate separate storage. Up to 8 entries may be stored in-line within the inode, excluding the self(.) and parent(..) directory entries, which are stored in separate areas of the inode.
The second organization is used for larger directories and represents each directory as a B+tree keyed on name. It provides faster directory lookup, insertion, and deletion capabilities when compared to traditional unsorted directory organizations.
Sparse and dense files
JFS supports both sparse and dense files, on a per-file system basis.
Sparse files allow data to be written to random locations within a file without instantiating previously unwritten intervening file blocks. The file size reported is the highest byte that has been written to, but the actual allocation of any given block in the file does not occur until a write operation is performed on that block. For example, suppose a new file is created in a file system designated for sparse files. An application writes a block of data to block 100 in the file. JFS will report the size of this file as 100 blocks, although only 1 block of disk space has been allocated to it. If the application next reads block 50 of the file, JFS will return a block of zero-filled bytes. Suppose the application then writes a block of data to block 50 of the file. JFS will still report the size of this file as 100 blocks, and now 2 blocks of disk space have been allocated to it. Sparse files are of interest to applications that require a large logical space but only use a (small) subset of this space.
For dense files, disk resources are allocated to cover the file size. In the above example, the first write (a block of data to block 100 in the file) would cause 100 blocks of disk space to be allocated to the file. A read operation on any block that has been implicitly written to will return a block of zero-filled bytes, just as in the case of the sparse file.
Back to top
Internal JFS (potential) limits
JFS is a full 64-bit file system. All of the appropriate file system structure fields are 64-bits in size. This allows JFS to support both large files and partitions.
File system size
The minimum file system size supported by JFS is 16 Mbytes. The maximum file system size is a function of the file system block size and the maximum number of blocks supported by the file system meta-data structures. JFS will support a maximum file size of 512 terabytes (with block size 512 bytes) to 4 petabytes (with block size 4 Kbytes).
File size
The maximum file size is the largest file size that virtual file system framework supports. For example, if the frame work only supports 32-bits, then this limits the file size.
Removable media
JFS will not support diskettes as an underlying file system device.
Back to top
Standard administrative utilities
JFS provides standard administration utilities for creating and maintaining file system.
Create a file system
This utility provides the JFS-specific portion of the mkfs command, initializing a JFS file system on a specified drive. This utility operates at a low level and assumes any creation/ initialization of the volume on which the file system is to reside is handled outside of this utility at a higher level.
Check/recover a file system
This utility provides the JFS-specific portion of the fsck command. It checks the file system for consistency and repairs problems discovered. It replays the log and applies committed changes to the file system meta-data. If the file system is declared clean as a result of the log replay, no further action is taken. If the file system is not deemed clean, indicating that the log was not replayed completely and correctly for some reason or that the file system could not be restored to a consistent state simply by replaying the log, then a full pass of the file system is performed.
In performing a full integrity check, the check/repair utility's primary goal is to achieve a reliable file system state to prevent future file system corruption or failures, with a secondary goal of preserving data in the face of corruption. This means the utility may throw away data in the interest of achieving file system consistency. Specifically, data is discarded when the utility does not have the information needed to restore a structurally inconsistent file or directory to a consistent state without making assumptions. In the case of an inconsistent file or directory, the entire file or directory is discarded with no attempt to save any portion. Any file or sub-directories orphaned by the deletion of the corrupted directory are placed in the lost+found directory located at the root of the file system.
An important consideration for a file system check/repair utility is the amount of virtual memory it requires. Traditionally, the amount of virtual memory required by these utilities is dependent on file system size, since the bulk of the required virtual memory is used to track the allocation state of the individual blocks in the file system. As file systems grow larger, the number of blocks increases and so does the amount of virtual memory needed to track these blocks.
The design of the JFS check/repair utility differs in that its virtual memory requirements are dictated by the number of files and directories (rather than the number of blocks) within the file system. The virtual memory requirements for the JFS check/repair utility are on the order of 32 bytes per file or directory, or approximately 32 Mbytes for a file system that contains 1 million files and directories, regardless of the file system size. Like all other file systems, the JFS utility needs to track block allocation states but avoids using virtual memory to do so by using a small reserved work area located within the actual file system.
Back to top
Summary
JFS is a key technology for Internet file servers since it provides fast file system restart times in the event of a system crash. Using database journaling techniques, JFS can restore a file system to a consistent state in a matter of seconds or minutes. In non-journaled file systems, file recovery can take hours or days. Most file server customers cannot tolerate the downtime associated with non-journaled file systems. Only by a technology shift to journaling could these file systems avoid the time-consuming process of examining all of a file system's meta-data to verify/restore the file system to a consistent state.
Resources
* JFS open source, on developerWorks
* IBM makes JFS technology available for Linux , dW feature story
About the author
Steve Best works in the Software Solutions & Strategy Division of IBM in Austin, Texas as a member of the File System development department. Steve has worked on operating system development in the areas of the file system, internationalization, and security. Steve is currently working on the port of JFS to Linux. He can be reached at sbest@us.ibm.com.