CSCS site drained due to sft-nightlies.cern.ch issue

rdimaria · March 11, 2024, 9:15am

Dear all,
I hope this finds you well.
I have opened a GGUS ticket (165538) to CVMFS unit, but they redirected me here.

CSCS site (Piz Daint part) is drained for ATLAS (only VO running on) as jobs are failing with error:

pilot:9000 1 Error reading user generated output file list

As sft-nightlies is not present in the /etc/cvmfs/default/local, cvmfs_talk proxy info shows no issue.
We have tried all the following disjoint tests:

cvmfs_config wipecache + service autofs restart
reboot of the nodes
add sft-nightlies to the list of CVMFS_REPOSITORIES in /etc/cvmfs/default.local, cvmfs_config reload + cvmfs_config wipecache + service autofs restart → cvmfs_config probe gives “Probing /cvmfs/sft-nightlies.cern .ch… Failed!” and
“sft-nightlies.cern.ch: Seems like CernVM-FS is not running in /var/tmp/cvmfs-workspace (not found: /var/tmp/cvmfs-workspace/cvmfs_io.sft-nightlies.cern.ch)”

We are out of ideas at this point.
Moreover, please consider that CVMFS setup at CSCS is common (squids) for both Piz Daint and Alps machines.
On Alps, we do not observe such behaviour. Even though sft-nightlies.cern.ch is (also) NOT present in the /etc/cvmfs/default/local, we are able to perform the following:

nid001250:~ # cat /cvmfs/sft-nightlies.cern.ch/lcg/lastUpdate 2024-03-09 19:57:48 | lxcvmfs146.cern.ch | 1710010668

May you please help on this at your earliest convenience?
Best regards,
Riccardo for CSCS

jakob · March 11, 2024, 9:36am

Hi Ricardo,

If I understand correctly, you’d like to make the /cvmfs/sft-nightlies.cern.ch repository available on Piz Daint. Usually, repositories get mounted automatically on access (e.g., ls /cvmfs/sft-nightlies.cern.ch/) if the proxies and keys are available. This is what happens apparently on Alps. The CVMFS_REPOSITORIES setting sets the default list of expected repositories (e.g., for cvmfs_config status) but it doesn’t prevent additional repositories from being mounted (there is another option to enforce that).

That said, on Piz Daint there is obviously a problem mounting the sft-nightlies.cern.ch repository. Could you try a manual mount with

mkdir -p /mnt/test
mount -t cvmfs sft-nightlies.cern.ch /mnt/test

That should give us more information.
Cheers,
Jakob

rdimaria · March 11, 2024, 10:27am

Hi Jakob,
thanks for the very prompt reply.

Indeed I was testing this manually with Enrico Bocchi (thanks!).

On Piz Daint, we cannot use autofs, differently than on Alps.
Adding the repo to default.local, creating dir /cvmfs/sft-nightlies.cern.ch and mount -vvv -t cvmfs sft-nightlies.cern.ch /cvmfs/sft-nightlies.cern.ch manually solved the issue.

However, I still cannot understand what happened on March 6-7th that force now the use of sft-nightlies.cern.ch.
May you please help here?

I will proceed requesting the manual mounting of this new repo on all Piz Daint nodes.

Best regards and thanks,
Riccardo

vavolkl · March 11, 2024, 10:57am

Hi Riccardo,

I’ve asked the ATLAS software experts about this, and while sft-nightlies has been a requirement for a subset of ATLAS jobs for a while, they’ve recently added a new check to their pilot jobs to check for it, which explains the recent failure. They should be able to confirm this here.
Cheers,
Valentin

plove · March 11, 2024, 1:19pm

Confirmed, this check has been added. Please see code here.

destefano · March 12, 2024, 11:25pm

Some ATLAS sites are considering working around this check by auto-mounting these repositories, and by keeping them mounted with a read/stat triggered by cron every few minutes? Do you see any possible downsides to this approach, @jakob? @vavolkl?

vavolkl · March 13, 2024, 8:39am

Hi John,

the problem at CSCS was due to autofs not being available, and sft-nightlies.cern.ch not yet manually mounted - if automounting is possible I would have expected that there is no need for a cron job to keep it mounted. The check should in effect mount it, right?

destefano · March 14, 2024, 1:35pm

Thanks Valentin, but in my case, most (nearly all) of the nodes that I have seen with these reported issues had no problems with autofs or mounting the repos on demand. The new pilot check does a much better job of checking whether the repo can be mounted and files accessed, as opposed to checking whether the files are already mounted, and we’ve seen zero reported issues since that was put in place. But if they arise again on nodes without issues, we may consider auto-mounting the repos that are being checked.