we have recently added new nodes to our batch farm, each one having 256 cores.
Users are reporting timeouts reading from CVMFS, mostly on those nodes.
Is it possible the performance of the client suffers as the number of cores increases?
This is the current setup, would it be better to increase those numbers:
more generally, the feedback from the users talks about a lot of timeouts reading content from CVMFS.
What would be the best way to troubleshoot these timeouts?
At this number of cores, it is quite possible that you run into scalability issues. Certainly use the latest cvmfs version (2.10.1, and 2.11 when it will be released later that year).
Are the timeouts on the level of the user jobs or are these cvmfs timeouts resulting in EIO errors? If it is the latter, it may be rather a Squid problem.
It is really hard to tell if it was really a problem with CVMFS or not.
The error from ATLAS jobs is
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase;
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh -c CentOS7']' timed out after 660 seconds
but no further indication on what exactly timed out.
Should there be anything in the logs on the client host (/var/log/messages in our case) when an attempt to read from CVMFS times out? I don’t see anything.