Hello,
We (the UVic HEPRC group) use CernVM4 to run HEP production jobs on different clouds using 8-core VMs, with 8 single-core jobs running in parallel. Occasionally, we observe that the job stalls because a filesystem metadata operation hangs. Often this is just an “ls -lR” in the HTCondor workdir. We speculate it may be related to the amount of filesystem activity occurring, as those jobs pull in a large amount of files to setup their environment instead of executing the software directly in cvmfs. Also, when we run on 2-core VMs with 2 simultaneous jobs, we see this behaviour much less frequently. We’ve seen this on multiple clouds, so it doesn’t appear to be an issue with the hypervisor. The rate doesn’t seem to be entirely consistent, but is roughly at the few percent level for 8-core VMs. Has anyone seen anything like this before? Any advice on debugging is appreciated. We’ve tried with both the latest version of CernVM4 and the one before. We have also reproduced this behaviour by just running a “dd” as job with 8 in parallel, each writing about 10GB to disk.
Thank you,
Tristan