Cvmfs_server snapshot software.eessi.io - replication failing

surrey_d.roe · February 8, 2024, 1:55pm

Hi All,

I was wondering if anyone else has experienced this issue?

I am trying to create a stratum1 replica of the eessi repository and the process keeps failing.

It starts replicating the catalog but partway through it fails with the error below:

Replicating from catalog at /versions/2023.06/software/linux/x86_64/amd/zen2/software/networkx/3.1-gfbf-2023a
  Catalog up to date
Replicating from catalog at /versions/2023.06/software/linux/x86_64/amd/zen2/software/pytest-flakefinder/1.1.0-GCCcore-12.3.0
  Catalog up to date
Replicating from catalog at /versions/2023.06/software/linux/x86_64/amd/zen2/software/pytest-rerunfailures/12.0-GCCcore-12.3.0
  Catalog up to date
Replicating from catalog at /versions/2023.06/software/linux/x86_64/amd/zen2/software/pytest-shard/0.1.2-GCCcore-12.3.0
  Catalog up to date
Replicating from catalog at /versions/2023.06/software/linux/x86_64/amd/zen2/software/sympy/1.12-gfbf-2023a
Processing chunks [2944 registered chunks]: ..failed to download http://aws-eu-west-s1-sync.eessi.science/cvmfs/software.eessi.io/data/e1/9f8c5cd1fb0e5a7f56aba50e0ca6cb923e79af (17 - host data transfer cut short)
couldn't reach Stratum 0 - please check the network connection
terminate called after throwing an instance of 'ECvmfsException'
  what():  PANIC: /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc8/build/BUILD/cvmfs-2.11.2/cvmfs/swissknife_pull.cc : 286
Download error
/usr/bin/cvmfs_server: line 7528: 237727 Aborted                 (core dumped) $user_shell "$(__swissknife_cmd dbg) pull -m $name         -u $stratum0                                           -w $stratum1                                           -r ${upstream}                                         -x ${spool_dir}/tmp                                    -k $public_key                                         -n $num_workers                                        -t $timeout                                            -a $retries $with_history $with_reflog                    $initial_snapshot_flag $timestamp_threshold $log_level"

I have tried removing the repo with cvmfs_server rmfs a few times and starting from scratch but some way through this starts happening. This time it stopped on this sympy package but this appears to be random as it got stuck on different packages on previous attempts.

any tips, pointers or help is most welcome!

thanks.

boegel · February 12, 2024, 8:12am

Are you only seeing this when creating the initial snapshot for software.eessi.io?

It seems like a more general network problem to me…
The sync server for software.eessi.io is running in AWS region eu-west-1 (Dublin).
Where is your Stratum 1 server located?

Also, can’t you just resume the initial snapshot, and it continues where it left off?

boegel · February 12, 2024, 12:39pm

It seems like you’re hitting a persistent issue, which we can replicate directly via curl.

We’ve opened an issue on this in the EESSI support portal, see Some files cannot be fetched from the CVMFS sync server (#49) · Issues · EESSI / EESSI support portal · GitLab

bedroge · February 12, 2024, 1:26pm

I could reproduce the issue with that file on my laptop (on our university network) and our cluster login nodes. However, it does work fine on other machines outside our university network, and it also worked fine for @boegel. Pulling the same file from our Azure stratum 1 does not seem to work for me either.

That made me think it’s our university firewall/threat detection system that may be blocking this file; we’ve actually seen a similar issue a few years ago with a (public) Stratum 1 that we were running in the university network, and it was blocking files pulled in by external users.

Could it be that you’re also doing this on a university network with similar scanning tools in place (in our case it’s Palo Alto)?

surrey_d.roe · February 12, 2024, 4:39pm

Hi, thanks for getting back to me.

Yes this is happening on the initial snapshot.

Our S1 server is located in Surrey, England.

I can just resume the initial snapshot to continue where it left off but it simply fails again with the same error message.

I am running another initial snapshot currently and its still running so I will update here on progress.

thanks.

surrey_d.roe · February 12, 2024, 4:40pm

Yes we have Palo’s too.

boegel · February 12, 2024, 4:50pm

Can you try using HTTPS rather than HTTP when you’re hitting the problem?

That should probably bypass the issue with Palo Alto, but it’s not 100% clear whether that will work.

surrey_d.roe · February 12, 2024, 5:13pm

I will try that next time I hit the problem. So far so good and I have a completed snapshot now.