Hi there,
We’ve configured a VM as a CVMFS publisher, utilizing S3 buckets as our Stratum 0. Recently, we’ve encountered recurring issues similar to the following:
terminate called after throwing an instance of ‘ECvmfsException’
what(): PANIC: /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc8/build/BUILD/cvmfs-2.12.6/cvmfs/catalog_mgr_ro.cc : 139
failed to load https://TENANT/BUCKET-NAME/data/b0/4a3fa6e56736cb0e8d63471abe4f4c87b3bab8C from Stratum 0 (9 - host returned HTTP error)
We’ve observed that immediately after the publisher uploads content to the Stratum 0 (S3 bucket), it attempts to perform a post-upload verification. Our S3 buckets are hosted on NetApp, which employs a load balancer. It appears there isn’t sufficient time for the content to fully propagate across all NetApp nodes before the verification check. The only temporary workaround we’ve identified is to target a specific NetApp node, which is not a recommended long-term solution.
We’ve attempted to adjust the following variables in /etc/cvmfs/default.conf
without success:
CVMFS_MAX_RETRIES=5
CVMFS_BACKOFF_INIT=40
CVMFS_BACKOFF_MAX=60
Therefore, the question is: Is there a mechanism within CVMFS to control or delay the timing of the post-upload verification checks against our S3 tenant? We need to provide ample time for the S3 bucket to replicate its content across all NetApp nodes and for the load balancer to become aware of these updates, thereby preventing these ECvmfsException
errors.
Thanks
Hi, sorry for the late reply, already vacation season here.
Hmm. This is a good question with no straightforward answer. I don’t think this error comes from a post-upload verification check ( we don’t do one currently but we’ll add this in the next release: Add CVMFS_PEEK_AFTER_PUT to doublecheck existence of object after put to S3 by mharvey-jt · Pull Request #2966 · cvmfs/cvmfs · GitHub ). I think the error comes from the cvmfs mount of the the publisher that is trying to fetch the new revision back from the server.
Wherever it comes from, we can certainly add an option to delay the timing. The problem is, I think your issues won’t stop there. CVMFS isn’t really built to be used with loadbalancers. The problem is alluded to here:
High availability
On the subject of availability, note that it is not advised to use two separate complete Stratum 1 servers in a single round-robin service because they will be updated at different rates. That would cause errors when a client sees an updated catalog from one Stratum 1 but tries to read corresponding data files from the other that does not yet have the files. Different Stratum 1s should either be separately configured on the clients, or a pair can be configured as a high availability active/standby pair using the cvmfs-contrib cvmfs-hastratum1 package. An active/standby pair can also be managed by switching a DNS name between two different servers.
This is talking about load balancing for stratum 1s, but for stratum0s similar considerations apply. I’ve seen though that at the last HCPKP there were a lot of requests regarding load balancing in the cvmfs infrastructure, and maybe there is a way to add a operating mode to cvmfs that works better with load balancers. Would be good to have some further chats about your usecase.
As a first step though, let me find where exactly your error is coming from, and see if I can add an option for you that let’s you wait out the replication.