Is there a way to fix a corrupted replica?

jcaballe · January 24, 2024, 8:55am

Hi,

other than starting from scratch, is there a mechanism or some trick to fix a replica that cannot be snapshotted? For example:

[root@cvmfs-stratum1 ~]# cvmfs_server snapshot unpacked.cern.ch
Initial snapshot
CernVM-FS: replicating from http://cvmfs-stratum-zero.cern.ch/cvmfs/unpacked.cern.ch
CernVM-FS: using public key(s) /etc/cvmfs/keys/cern.ch/cern-it1.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it4.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it5.cern.ch.pub
Creating an empty Reflog for 'unpacked.cern.ch'
Found 99 named snapshots
Uploading history database
Starting 16 workers
Replicating from trunk catalog at /
  Processing chunks [4469 registered chunks]: .^C
[root@cvmfs-stratum1-02 ~]#
[root@cvmfs-stratum1-02 ~]# cvmfs_server snapshot sft.cern.ch
Initial snapshot
CernVM-FS: replicating from http://cvmfs-stratum-zero.cern.ch/cvmfs/sft.cern.ch
CernVM-FS: using public key(s) /etc/cvmfs/keys/cern.ch/cern-it1.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it4.cern.ch.pub, /etc/cvmfs/keys/cern.ch/cern-it5.cern.ch.pub
Creating an empty Reflog for 'sft.cern.ch'
Found 251 named snapshots
Uploading history database
Starting 16 workers
Replicating from trunk catalog at /
  Processing chunks [9 registered chunks]: . fetched 1 new chunks out of 9 unique chunks
Replicating from catalog at /lcg/etc
  Processing chunks [847 registered chunks]: terminate called after throwing an instance of 'ECvmfsException'
  what():  PANIC: /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc8/build/BUILD/cvmfs-2.11.2/cvmfs/swissknife_pull.cc : 95
spooler failure 2 (/var/spool/cvmfs/sft.cern.ch/tmp/cvmfs.nQWPuF, hash: 0000000000000000000000000000000000000000)
/usr/bin/cvmfs_server: line 7528: 701345 Aborted                 (core dumped) $user_shell "$(__swissknife_cmd dbg) pull -m $name         -u $stratum0                                           -w $stratum1                                           -r ${upstream}                                         -x ${spool_dir}/tmp                                    -k $public_key                                         -n $num_workers                                        -t $timeout                                            -a $retries $with_history $with_reflog                    $initial_snapshot_flag $timestamp_threshold $log_level"

dwd · January 24, 2024, 5:56pm

There is usually a way to fix it, but it depends on the error. In this case it looks like “spooler failure 2” indicates a write error. Is /var/spool/cvmfs filesystem full or something? Is /var/spool/cvmfs/sft.cern.ch/tmp a valid writable path?

jcaballe · January 25, 2024, 3:28pm

Hi,

the filesystem where the /var/spool/ directories are got unmounted for a while. That explains why a few repos got into that state.

So, what is the correct way of fixing those broken repos? Running a new snapshot doesn’t help.

Cheers,
Jose

dwd · January 25, 2024, 5:18pm

It sounds like your /var/spool/cvmfs subdirectory for those repositories are messed up. Compare it to a good one, in particular the tmp symlink. Worst case you could probably clean them up with remove-repository -H to remove the configuration without removing the data, and then adding it back with add-repository -h. (I am assuming this is your old server and you’re still using cvmfs-hastratum1. If not, the corresponding command for removal is cvmfs_server rmfs -p and then I think doing a normal cvmfs_server add-replica.)