公司有一台SAN存储,其中划出了26TB的容量通过open-iscsi输出到某台Ubuntu Server上,然后再在这台Ubuntu Server配置好NFS,将26TB共享至其它Clients上。一直使用的好好的,在某一天,客户端在挂载时,突然就出现了Stale NFS file handle的错误。经过无数次谷歌以后,找到了解决方案。这里记录一下。
一开始,以为是分区出现逻辑错误,准备尝试fsck一下(XFS文件系统不能使用fsck命令),于是有了以下过程:
$ sudo xfs_check /dev/vg-15k/users
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_check. If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
$ sudo xfs_repair /dev/vg-15k/users
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
$ sudo xfs_repair -L /dev/vg-15k/users
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
- scan filesystem freespace and inode maps...
sb_icount 64, counted 4884352
sb_ifree 61, counted 15726
sb_fdblocks 2683832778, counted 1409694604
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 8
- agno = 9
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 3
- agno = 2
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- agno = 9
- agno = 8
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
以为修复完成,便会万事大吉。结果问题依旧,挂载时仍旧收到了Stale NFS file handle的错误。
各种搜索之,后来发现下面这文章(也可以参考XFS官网的介绍):
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk.
If you have a disk with 100TB, all inodes will be stuck in the first TB.
This can lead to strange things like "disk full" when you still have plenty space free,
but there's no more place in the first TB to create a new inode. Also, performance sucks.
To come around this, use the inode64 mount options for filesystems >1TB.
Inodes will then be placed in the location where their data is, minimizing disk seeks.
大意是说,XFS默认(使用inode32)会把inodes(索引节点)放在硬盘的前1TB容量上,如果你有100TB的硬盘,所有的inodes都会挤在前1TB里,这会导致奇怪的事情(例如磁盘已满)发生,即使你还有大量空间,因为不能在前1TB的空间上创建新的索引节点,同时,性能也会急剧下降。为了避免这个问题,请使用inode64的挂载选项,inodes将会被放在数据存在的位置,最大限度的减少硬盘寻道。
同时,XFS官网的这篇文章还介绍了Why doesn’t NFS-exporting subdirectories of inode64-mounted filesystem work:
The default fsid type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default fsid types (fsid=uuid in /etc/exports with recent nfs-utils) should work as well.
fsid是NFS用来识别自身导出的每个文件系统的,通常情况下,NFS会使用设备的UUID或者文件系统上的设备数量来当作fsid;
有时候,有的文件系统并没有UUID,或者并非所有的文件系统都是直接存放在设备上,因此,需要明确指定fsid以便NFS可以识别一个文件系统;
默认的fsid类型只为二级目录准备了32位的inode数量,所以,解决方法为,导出文件系统的根分区,或者使用非默认的fsid类型。
解决办法:
1,挂载XFS分区时,记得添加inode64参数。否则会默认使用inode32来挂载。
2,本地目录通过NFS export时,记得添加fsid=XX参数。
以下是范例:
$ mount -oremount,inode64 /dev/diskpart /mnt/xfs
$ cat /etc/fstab
……
/dev/vg-15k/users /data/users xfs defaults,noatime,nobarrier,inode64 0 0
……
$ cat /etc/exports
……
/data/usersA *(rw,async,no_root_squash,no_subtree_check,fsid=1)
/data/usersB *(rw,async,no_root_squash,no_subtree_check,fsid=2)
/data/usersC *(rw,async,no_root_squash,no_subtree_check,fsid=3)
/data/usersD *(rw,async,no_root_squash,no_subtree_check,fsid=4)
/data/usersE *(rw,async,no_root_squash,no_subtree_check,fsid=10)
/data/usersF *(rw,async,no_root_squash,no_subtree_check,fsid=100)
/data/usersG *(rw,async,no_root_squash,no_subtree_check,fsid=500)
……