Filesystem Operations Hang

Hello,

We (the UVic HEPRC group) use CernVM4 to run HEP production jobs on different clouds using 8-core VMs, with 8 single-core jobs running in parallel. Occasionally, we observe that the job stalls because a filesystem metadata operation hangs. Often this is just an “ls -lR” in the HTCondor workdir. We speculate it may be related to the amount of filesystem activity occurring, as those jobs pull in a large amount of files to setup their environment instead of executing the software directly in cvmfs. Also, when we run on 2-core VMs with 2 simultaneous jobs, we see this behaviour much less frequently. We’ve seen this on multiple clouds, so it doesn’t appear to be an issue with the hypervisor. The rate doesn’t seem to be entirely consistent, but is roughly at the few percent level for 8-core VMs. Has anyone seen anything like this before? Any advice on debugging is appreciated. We’ve tried with both the latest version of CernVM4 and the one before. We have also reproduced this behaviour by just running a “dd” as job with 8 in parallel, each writing about 10GB to disk.

Thank you,

Tristan

I have seen this type of behavior when the root catalog is large. Does the repository you’re reading from have a .cvmfsdirtab file to create nested catalogs as appropriate?

Thank you for your reply. This issue actually seems to not be related to CVMFS; we have reproduced it by just writing large files to the local root disk of CernVM, with no interaction with CVMFS.

I wonder if it is related to the fact that the CernVM root file system is a union file system mount. If you can narrow down the locations where large files are written to specific directories, you can try to symlink them to /mnt/.rw, which goes directly to the virtual disk. This would of course only work for cases where you don’t need to mix new content and existing content. E.g., this works for /var/lib/xyz but not for /usr/bin.

Thanks for the suggestion; this should be possible in our case, so we’ll give it a try.

It’s been three weeks since we made the change, and we haven’t seen the problem since, so it seems to have worked. Thanks very much again for your help!