Monitoring CVMFS used as a container registry in an HTC environment (pull timing + Squid architecture)

gsaudade · November 14, 2025, 1:34pm

Hello there! Im a pretty new system administrator for usegalaxy.eu and one of my tasks is to figure out how to monitor cvmfs.

The cvmfs is used as a container registry for many compute nodes. All compute nodes mount cvmfs directly, and container runtimes fetch image layers via the cvmfs fuse client. Upstream of our stratum servers we also have a Squid cache.

Do you have any tips / good practices on how to monitor pull performance and read latencies in the cvmfs/squid setup?

vavolkl · November 14, 2025, 2:04pm

Hi Gabriel, welcome!

Yes, that’s certainly a good use case. What kind of container runtime do you use (Apptainer/Containerd/podman… ?)

Do you have a monitoring system already in place? If you can use prometheus, we’ve recently added a prometheus exporter for the cvmfs client, that gives very extensive performance metrics (maybe it’s even a bit overkill with some internal metrics). See GitHub - cvmfs-contrib/prometheus-cvmfs: CVMFS Client prometheus exporter

For squid, I’m not an expert, maybe others have some pointers (@dwd @cvuosalo ?)

For pull and read latencies, that again depends on your container runtimes. We did do some benchmarking ( KubeCon + CloudNativeCon Europe 2025: Image Snapshotters for Efficient Contain... ) but that was basically done with manual timing.

Cheers,
Valentin

dwd · November 20, 2025, 5:03pm

I’m not aware of any alarms that are watching squid for performance, but we do plot the in/out data throughput and number of hits vs fetches and so can tell when we look at the plot if there’s an unusually poor hit ratio. We use MRTG, with plots at http://wlcg-squid-monitor.cern.ch/snmpstats/all.html.

We also configure fallback proxies for when local proxies fail, and monitor them for large numbers of hits so we can work with the site administrator to get their proxies working again. That uses the tool awstats to collect data on the stratum-1 and we turn that data into a plot showing where the hits are coming from by organization, at http://wlcg-squid-monitor.cern.ch/failover/failoverCvmfs/failover.html.

Dave