Degradation of performance on nodes with higher number of cores?

jcaballe · April 24, 2023, 10:01am

Hi,

we have recently added new nodes to our batch farm, each one having 256 cores.
Users are reporting timeouts reading from CVMFS, mostly on those nodes.
Is it possible the performance of the client suffers as the number of cores increases?
This is the current setup, would it be better to increase those numbers:

CVMFS_MEMCACHE_SIZE=64
CVMFS_QUOTA_LIMIT=80000

Thanks a lot in advance.
Cheers,
Jose

jcaballe · April 25, 2023, 11:50am

Hi again,

more generally, the feedback from the users talks about a lot of timeouts reading content from CVMFS.
What would be the best way to troubleshoot these timeouts?

Cheers,
Jose

jakob · April 25, 2023, 12:42pm

Hi Jose,

At this number of cores, it is quite possible that you run into scalability issues. Certainly use the latest cvmfs version (2.10.1, and 2.11 when it will be released later that year).

Are the timeouts on the level of the user jobs or are these cvmfs timeouts resulting in EIO errors? If it is the latter, it may be rather a Squid problem.

Cheers,
Jakob

jcaballe · April 25, 2023, 2:23pm

We are still using 2.9.x. This may be the perfect excuse to upgrade.
I will ask about the exact error.

jcaballe · April 26, 2023, 12:08pm

Hi,

It is really hard to tell if it was really a problem with CVMFS or not.
The error from ATLAS jobs is

export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase;
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh -c CentOS7']' timed out after 660 seconds

but no further indication on what exactly timed out.
Should there be anything in the logs on the client host (/var/log/messages in our case) when an attempt to read from CVMFS times out? I don’t see anything.

Cheers,
Jose