Gpfs Problems
Gpfs Problems
Gpfs Problems
Background: We run a public ftp/http/rsync server for free software at ftp.acc.umu.se, among other things we distribute Debian GNU/Linux (as ftp.se.debian.org) and GNOME (as ftp.gnome.org). The filesystem (ftpmirror) is about 66 gigabytes in size and contains about 160000 (160k) i-nodes, of which ~133k are files, ~17k are symlinks and ~9k are directories. History: Our old server was a Sun server (dual 60MHz supersparc) that had some performance issues. This was especially noticable with rsync, and we could only allow 8 simultaneous rsyncs without getting problems with timeouts, lost connections and unresponsive server. The new server was made possible thanks to the donation of an old IBM SP from PDC. We decided to use GPFS for shared storage, and a cluster solution with two frontends, two disk backends and a loadbalancer for the rsync service. Current setup: Hardware: All nodes are 66.7 MHz Power2 wide nodes with 512 megabytes of ram, except churchill that has 1024 megabytes of ram. The frontend nodes tutankhamon and napoleon run http and ftp services and a loadbalancer that distributes the rsync connections across the cluster. The two frontends are balanced with a dns round-robin setup. These nodes also do rsync "behind" the loadbalancer. The two disk backends are xiaoping and brezhnev. These nodes also do rsync. Xiaoping has 1 36-gig disk and 2 9 gig disks. Brezhnev has 1 36-gig disk, 1 18 gig disk and 4 9 gig disks. The reason for the uneven disk distribution is that most of the disks in brezhnev were in use in the old server, and we used some additional storage in xiaoping to make the switch of servers without downtime. The additional backend node churchill does rsync, updating and backups. It is also the file system manager node for the ftpmirror filesystem. Relevant gpfs nodeset config: pagepool 42M maxFilesToCache 16K comm_protocol TCP maxStatCache 256K Relevant gpfs filesystem config: -s balancedRandom Stripe method -B 262144 Block size -m 2 Default number of metadata replicas -F 262656 Maximum number of inodes VSD/MMFS fileset info: vsd.cmi 3.2.0.4 VSD Centralized Management Interface (SMIT) vsd.hsd 3.2.0.1 VSD Hashed Shared Disk
Recoverable VSD Connection Manager Recoverable VSD Daemon Recoverable VSD Recovery Scripts VSD sysctl commands VSD Device Driver GPFS File Manager Commands GPFS File Manager GPFS File Manager GPFS Server Messages - U.S. English
Nodes are running AIX4.3.3 with recent patches applied (as of September 2001). Problems: 1. Lack of replication control on a filesystem or disk level. At first, we wanted to make the disk backend nodes redundant by replicating all data across both nodes. But since lots of disks already were in use in the old server, we did not have the space to replicate all data at the migration time. We then tried, repeatedly, to make replication work. The main problem is that mmchfs only changes the default for new files. We wanted to change the replication for the entire filesystem. After some testing, we found out that the metadata replication didn't seem to be changable after the creation of the filesystem, and even if one created it with metadata replication and changed the replication of all files (via find and mmchattr), the filesystem did not survive a reboot of one of the disknodes. What we would want is a filesystem level replication. We would like to say that "now the filesystem should be replicated, both data and metadata", then have all data replicated across the nodes, protecting from node failure. [2002-01-11] Should be fixed in IY24954. 2. Mistakes in management often led to large-scale filesystem corruption. We had several occasions of filesystem corruption, leading to a restore from backup and big service disruptions for our users. One of the causes was a mismatch between vsd buffer sizes. We missed the step of "dsh ucfgvsd -a" when changing vsd buffer size. We also had data corruption bugs due to bad harddrives, but this is perhaps not something we can blame gpfs for. You would expect that gpfs would realise a drive has gone bad when there are hard LVM write errors on it, but it seems that gpfs/vsd didn't realise it. 3. Bad rsync performance. rsync demands good metadata performance since it essentially stat()/lstat()'s all files in order to figure out which files have changed. We realize that there might still be tuning to do, but right now a benchmark of rsync over the debian dataset (rsync -aqn) gives
times from 9 to 30 minutes, the same test on jfs shows times from 3 to 5 minutes. Vfsstats for a typical rsync only node is shown in the file "brezhnev-vfsstats.txt", these statistics have been gathered over a period of several days. A few items the we find odd: * read/write of data to the filesystem seems to purge the stat-cache, if we only do repetitive metadata benchmarks we get quite predictible results. If we apply "standard load" on the filesystem, ie enable ftp/http access, the metadata cache seems to get expired. * readlink seems to take a significant amount of time, optimizations of this syscall should improve speed dramatically. 4. No way of specifying "noatime" for the gpfs filesystem. This is a problem we solved with some kludges, but we do not understand why it isn't a basic feature. There seems to be some functionality in mmchfs and mmlsfs that indicates that such a feature should exist, but we were unable to make it work. We did some local modifications to some gpfs system management scripts, so that our filesystem always is mounted noatime. This is essential from a performance point of view, because we want working and active caches of directories and metadata available at all nodes at all times. We do not want any node to issue a revoke so that it will be able to write atime, since atime is of no interest for us. Ultimately there should be a way to specify that some nodes should only have read-only access to the data which ought to enable further optimizations to be made since the is no concerns of those nodes having write tokens and so on. [Fixed by IBM.] 5. Very high load on filesystem manager node after reset. When for some reason the nodes drop all their tokens, after a reboot of all nodes, restart of the mmfsd or similar, the file system manager node has very high load (50+) and is pretty much unusable for any other processes for 8 to 12 hours. This is probably because it has to deal out read tokens for all metadata to all nodes, as quickly as possible. Also, the file manager seems to be getting "load peaks" at a regular interval after a mmfs restart on all nodes. This is probably due to some garbage/sanity check that occurs after a token has existed for a specific amount of time, and all tokens will have essentially the same creation time after startup. One workaround that we have considered, and tested, is to only run rsync on the two frontend nodes tutankhamon and napoleon at first, this to reduce the number of tokens the filesystem manager has to distribute and maintain. This works a bit, the node still lags and has high load, but it is not as bad as otherwise. One problem is that when later enabling new nodes the file system manager node gets quite high load.
[IBM recommends faster hardware and perhaps a software upgrade] 6. Unlink seems slow When running rsync to update our mirror we have observed that it takes very long time to delete (unlink) files. In some cases we have observed times of approximately 1 second per file (very noticable since the Debian master site has a 1200 second IO timeout on rsync connections). This is probably related to the fact that all read tokens for the file has to be revoked before the unlink syscall returns. For our purpose, there is no need to have all tokens revoked when unlink returns since the data on the other nodes are essentially read only, a "sloppy" unlink that will return as quick as possible and handle token revocation and so on as it seems fit. Essentially, we have a need for rsync to be able to delete a lot of files quickly and don't require the delete to happen instantly on all nodes. There should also be possible to do some optimization when mmfs realises that a process now is unlinking hundreds of files. [2002-01-11] Actually this only seems to appear as a problem after a reset, when the filesystem manager node has a very high load. Merged with the problem numbered 5 here. 7. Unable to handle node failures with KLAPI When running with KLAPI for VSD a node failure was fatal (filesystem umounted) even when having created the filesystem with full replication. [We do not run with KLAPI anymore] 8. Unable to reinstall nodes smoothly with LAPI When running LAPI as thansport for gpfs a reinstalled node was unable to start gpfs ("node has new flag in cluster.nodes" or something similar), reinstalling a node seems to have exactly the same problem as adding a node to a nodeset in LAPI configurations. [We are not using LAPI anymore] 9. mmaddisk hangs at "Extending Allocation Map" Sometimes when adding disk to a filesystem with mmaddisk the command will stop after printing "Extending Allocation Map" and stay that way for well over half an hour. Some other mmfs commands also hangs after this (like mmdf). The command mmlsdisk gives the status as "allocmap add". All write attempts also hangs, but reading can continue. This might be triggered by not having a valid kerberos ticket when executing the command on the cws, that is the only odd thing I have noticed when it has failed. If the command is interrupted, most gpfs commands just hang when trying to do something with the filesystem (even mmlsfs hangs), the only thing I have found that fixes this is a restart of mmfs
on the filesystem manager node and some other node. It is possible that there exists a less invasive way of getting it to work, but it is nontrivial to find. [IBM thinks this is not a bug] [added 2001-11-06] 10. Sponatneous reboot of file system manager node We have managed to find a repeatable way to reboot the file system manager of a GPFS filesystem by having activity in the filesystem. With write activity now and then during a long time (several days), the file system manager reboots and also clears the errpt log (making it harder to find out what went wrong). We have only managed to repeat this on one of the nodes, since the others run services that we do not want to interrupt unnecessarily. Other than when there is activity on the filesystem and for other nodes than the file system manager node the nodes are stable. We do not know what causes this and we haven't found any information that seems relevant in any log files, but we may have missed some of the files. This system contains two 36-gig disks, mmlsdisk output: disk driver sector failure holds holds name type size group metadata data status ------------ -------- ------ ------- -------- ----- ------------gpfs36vsd disk 512 4007 yes yes ready gpfs37vsd disk 512 4005 yes yes ready [2002-01-11] I was unable to reproduce this problem with debug information. [added 2002-02-15] 11. mmfsd segmentation fault When mounting a GPFS filesystem ro,noatime it seems to crash when a token is revoked. It works when having the filesystem mounted rw,noatime. This issue was very obvious on our frontend machines that had mmfs crashing every 5 minutes or so when running mmdefragfs on the filesystem (which causes lots of token revokations) until we returned to having the filesystems mounted rw. The mmfs logfiles at http://www.acc.umu.se/~maswan/util/ACC/crashlogs-2002-02-15.tar.Z shows this clearly. Current relevant fileset versions: mmfs.base.cmds 3.3.0.6 GPFS File Manager Commands mmfs.base.rte 3.3.0.7 GPFS File Manager mmfs.gpfs.rte 1.4.0.7 GPFS File Manager mmfs.msg.en_US 3.3.0.5 GPFS Server Messages - U.S. English vsd.cmi 3.2.0.4 VSD Centralized Management Interface (SMIT) vsd.hsd 3.2.0.2 VSD Hashed Shared Disk vsd.rvsd.hc 3.2.0.4 Recoverable VSD Connection Manager vsd.rvsd.rvsdd 3.2.0.8 Recoverable VSD Daemon vsd.rvsd.scripts 3.2.0.8 Recoverable VSD Recovery Scripts vsd.sysctl 3.2.0.4 VSD sysctl commands
availability -----------up up
vsd.vsdd bos.up
3.2.0.12 VSD Device Driver 4.3.3.78 Base Operating System Uniprocessor Runtime
[IBM claims this to be fixed, we have not verified this yet] [added 2003-04-24] 12. Even with full replication, filesystem is sometimes unmounted when a node is taken offline. This is reported as vsd devices being unavailable, even though it should not be fatal. If the disks are nicely stopped with mmchdisk, this does not seem to occur. [added 2003-04-24] 13. We have seen data some disks down, even Sometimes files being replaced with garbled and metadata corruption when writing to a filesystem with though we have full data and metadata replication. written are corrupted, and often symlinks are data.