I’m evaluating the Compute Canada software stack via CVMFS, and my first proof of concept test has resulted in constant failures with Input/Output errors when trying to load and run many different modules, from R to Python to applications in the root like curl.
I am currently using 1 single machine and no forward Squid proxy, so I expected performance to be bad but I did not expect applications to fail to work. If I’m able to successfully get applications to work, I plan to deploy a squid proxy before adding more than a single client.
I am on Ubuntu 24.04 and following Accessing CVMFS - Alliance Doc . If there’s a more appropriate avenue for me to ask questions about the Compute Canada stack, I’d appreciation direction towards it.
My current setup is:
Thanks for the assistance! Even with my broken configuration, it was working after retrying commands and trying cvmfs_config reload, just slowly and inconsistently.
Removing CVMFS_HTTP_PROXY=DIRECT and falling back on the default value for CVMFS_HTTP_PROXY seems to have helped things, but I’m still seeing Input/Output errors as I test more applications.
Here is my whole cvmfs_config showconfig:
Running /usr/bin/cvmfs_config soft.computecanada.ca:
CVMFS_REPOSITORY_NAME=soft.computecanada.ca
CERNVM_GRID_UI_VERSION=
CVMFS_ALIEN_CACHE=
CVMFS_ALT_ROOT_PATH=
CVMFS_AUTHZ_HELPER=
CVMFS_AUTHZ_SEARCH_PATH=
CVMFS_AUTO_UPDATE=
CVMFS_BACKOFF_INIT=2 # from /etc/cvmfs/default.conf
CVMFS_BACKOFF_MAX=10 # from /etc/cvmfs/default.conf
CVMFS_BASE_ENV=1 # from /etc/cvmfs/default.conf
CVMFS_CACHE_BASE=/var/lib/cvmfs # from /etc/cvmfs/default.conf
CVMFS_CACHE_DIR=/var/lib/cvmfs/shared
CVMFS_CACHE_PRIMARY=
CVMFS_CACHE_REFCOUNT=true # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf
CVMFS_CHECK_PERMISSIONS=yes # from /etc/cvmfs/default.conf
CVMFS_CLAIM_OWNERSHIP=yes # from /etc/cvmfs/default.conf
CVMFS_CLIENT_PROFILE=single # from /etc/cvmfs/default.local
CVMFS_CONFIG_REPOSITORY=cvmfs-config.cern.ch # from /etc/cvmfs/default.d/50-cern-debian.conf
CVMFS_CONFIG_REPO_DEFAULT_ENV=1 # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf
CVMFS_CONFIG_REPO_REQUIRED=
CVMFS_DEBUGLOG=/tmp/cvmfs.log # from /etc/cvmfs/default.local
CVMFS_DEFAULT_DOMAIN=cern.ch # from /etc/cvmfs/default.d/50-cern-debian.conf
CVMFS_DNS_RETRIES=
CVMFS_DNS_TIMEOUT=
CVMFS_EXTERNAL_FALLBACK_PROXY=
CVMFS_EXTERNAL_HTTP_PROXY=
CVMFS_EXTERNAL_SERVER_URL=
CVMFS_EXTERNAL_TIMEOUT=
CVMFS_EXTERNAL_TIMEOUT_DIRECT=
CVMFS_FALLBACK_PROXY= # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_FOLLOW_REDIRECTS=
CVMFS_HIDE_MAGIC_XATTRS=yes # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf
CVMFS_HOST_RESET_AFTER=1800 # from /etc/cvmfs/default.conf
CVMFS_HTTP_PROXY='auto;DIRECT' # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_IGNORE_SIGNATURE=
CVMFS_INITIAL_GENERATION=
CVMFS_IPFAMILY_PREFER=
CVMFS_KCACHE_TIMEOUT=
CVMFS_KEYS_DIR=/cvmfs/cvmfs-config.cern.ch/etc/cvmfs/keys/computecanada.ca # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_LOW_SPEED_LIMIT=1024 # from /etc/cvmfs/default.conf
CVMFS_MAGIC_XATTRS_VISIBILITY=rootonly # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf
CVMFS_MAX_IPADDR_PER_PROXY=
CVMFS_MAX_RETRIES=1 # from /etc/cvmfs/default.conf
CVMFS_MAX_TTL=
CVMFS_MEMCACHE_SIZE=
CVMFS_MOUNT_DIR=/cvmfs # from /etc/cvmfs/default.conf
CVMFS_MOUNT_RW=
CVMFS_NFILES=131072 # from /etc/cvmfs/default.conf
CVMFS_NFS_SHARED=
CVMFS_NFS_SOURCE=
CVMFS_OOM_SCORE_ADJ=
CVMFS_PAC_URLS='http://grid-wpad/wpad.dat;http://wpad/wpad.dat;http://cernvm-wpad.cern.ch/wpad.dat;http://cernvm-wpad.fnal.gov/wpad.dat' # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/default.conf
CVMFS_PROXY_RESET_AFTER=300 # from /etc/cvmfs/default.conf
CVMFS_PROXY_TEMPLATE=
CVMFS_PUBLIC_KEY=
CVMFS_QUOTA_LIMIT=4000 # from /etc/cvmfs/default.conf
CVMFS_RELOAD_SOCKETS=/var/run/cvmfs # from /etc/cvmfs/default.conf
CVMFS_REPOSITORIES=soft.computecanada.ca # from /etc/cvmfs/default.local
CVMFS_REPOSITORY_DATE=
CVMFS_REPOSITORY_TAG=
CVMFS_ROOT_HASH=
CVMFS_SEND_INFO_HEADER=yes # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_SERVER_CACHE_MODE=
CVMFS_SERVER_URL=http://cvmfs-s1.computecanada.net/cvmfs/soft.computecanada.ca # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_SHARED_CACHE=yes # from /etc/cvmfs/default.conf
CVMFS_STRICT_MOUNT=no # from /etc/cvmfs/default.conf
CVMFS_SYSLOG_FACILITY=
CVMFS_SYSLOG_LEVEL=
CVMFS_SYSTEMD_NOKILL=
CVMFS_TIMEOUT=5 # from /etc/cvmfs/default.conf
CVMFS_TIMEOUT_DIRECT=10 # from /etc/cvmfs/default.conf
CVMFS_TRACEFILE=
CVMFS_TRUSTED_CERTS=
CVMFS_USER=cvmfs # from /etc/cvmfs/default.conf
CVMFS_USE_CDN=
CVMFS_USE_GEOAPI=yes # from /cvmfs/cvmfs-config.cern.ch/etc/cvmfs/domain.d/computecanada.ca.conf
CVMFS_WORKSPACE=
Could you please provide more debug logs? In particular each one of them:
when it is working
when it is working but took a long time
when it is failing
I think you have 2 issues.
Your config of the server and proxies is suboptimal - but this should not be the failing reason.
You normally either set CVMFS_CLIENT_PROFILE or CVMFS_HTTP_PROXY but only in rare cases both. If you use DIRECT you cannot use GEOAPI because there is only one server for compute canada if you set DIRECT.
I think your main issue is with the cache. Could you please let us know which kind of errors you find? Is it EIO (01)? If yes you need to increase CVMFS_QUOTA_LIMIT= to something larger than the default CVMFS_QUOTA_LIMIT=4000 (= 4GB)
My suggestion for your minimal config in /etc/cvmfs/default.local:
CVMFS_CLIENT_PROFILE=single
CVMFS_QUOTA_LIMIT=10000
CVMFS_DEBUGLOG=/tmp/cvmfs-@fqrn@.log # this makes a separate log for each repo with reponame @fqrn@
Let me know if that helps
Cheers
Laura
PS. CVMFS_REPOSITORIES is not needed. This is only useful if: you want to auto-run the cvmfs config commands on the list repositories OR if you want to enforce that users only mount those repositories with CVMFS_STRICT_MOUNT=on
Also make sure there’s enough disk space to hold CVMFS_QUOTA_LIMIT worth of cache.
It doesn’t really make sense that you’ll have better success leaving CVMFS_HTTP_PROXY unset than to set it to DIRECT, because the default setting is auto;DIRECT which means it will try to first read http://cernvm-wpad.cern.ch/wpad.dat and that probably does not return anything helpful. It doesn’t know that computecanada.net is Cloudflare so it returns NONE. In fact that’s probably where that http://NULL is coming from.
Our configuration is intended to work out of the box and make things simple and easy for end users.
There is a snippet of config here: config-repo/etc/cvmfs/domain.d/computecanada.ca.conf at master · cvmfs-contrib/config-repo · GitHub
which should result in using cvmfs-s1.computecanada.net in situations like this.
That Cloudflare LB is backed by 4 origin servers, with dynamic steering to detect the closest one, and has zero-downtime failover between them if one fails in the short time period between a HTTP request being made and the automated health checks detecting a failure. We haven’t had any issues known/reported with that configuration.
But again, end users should not have to worry about these details.
@dwd it sounds like the issue is caused by WPAD? I didn’t know that was activated by default. Why is it returning NONE and potentially breaking the connection by setting http://NULL ?
Currently if you do curl http://cernvm-wpad.cern.ch/wpad.dat from an IP address not associated with a proxy you will get this result:
// no squid found matching the remote ip address
function FindProxyForURL(url, host) {
if (shExpMatch(url, "*.openhtc.io*")) {
return "DIRECT";
}
return "NONE";
}
So it’s only recognizing openhtc.io as a destination to use for DIRECT. The idea is that we don’t want clients directly connecting to stratum 1s, but Cloudflare is OK in limited numbers. The WLCG WPAD service hasn’t yet been configured to recognize computecanada.net as an acceptable DIRECT destination. I will make that configuration change today. Note that cernvm-wpad.{cern.ch|fnal.gov} are also configured to only accept a limited number of requests in a period of time from a single GeoIP “organization” before they start redirecting to failover proxies. That is to prevent abuse of Cloudflare from an organization that should be supplying its own squid.
In addition, I’m pretty sure you and I must have discussed this before, but for the record and in case others come across this post: using multiple stratum 1s behind a single alias has the potential of confusing the CVMFS client and is not recommended. That’s because stratum 1 updates are not synchronized, and if the client reads a catalog from one stratum 1 it assumes that all files associated with that update are present on that stratum 1. By switching stratum 1 servers without the knowledge of the CVMFS client, the client could get an error when attempting to read a file that is in the catalog it sees but not yet present on the other stratum 1. I’m not sure if that returns an immediate fatal error to the user, but at minimum the client will consider that stratum 1 to be broken and not use it anymore.
Instead, I recommend that Compute Canada and anyone else that wants to set up their own Cloudflare alias do what we do with openhtc.io, and assign a separate alias to each stratum 1. In addition, the stratum 1 geo api recognizes when it is given a request from Cloudflare and tries to look up each stratum 1 name with an ip. prefix to find out the real IP address of the stratum 1s in order to do the geo sorting. This is easily done in Cloudflare by creating another DNS entry that is not proxied.
Thanks for fixing the WPAD config. So it injects http://NULL for unknown servers if DIRECT is used? People are using lots of servers in lots of domains so it doesn’t seem feasible to keep track of all of them in WPAD config. Is there some other conditional that enables the use of WPAD, like CVMFS_CLIENT_PROFILE = single or CVMFS_USE_CDN= yes ?
Yes, it’s important to be careful about how the Cloudflare mechanisms interact with the CVMFS ones. We’re using additional features of Cloudflare; the dynamic steering ensures that the closest origin is always used so there is generally no switching between servers. Having all our origins behind one LB address ensures the Cloudflare caching layer is more efficient. This is the typical practice with commercial CDNs, although it looks a bit different from the usual CVMFS way. GeoAPI doesn’t work well on the Canadian research network anyway. Anyway I agree the setup you described makes sense when using only the free features of Cloudflare.
In the future you can contact Compute Canada (DRAC) support directly via https://docs.alliancecan.ca/wiki/Technical_support , although in this case it was beneficial to discuss it here as I was not aware of the external WPAD mechanism that seems to have caused the issue (although it was not until 4 months later that I noticed this thread and the issue was fixed).
So it injects http://NULL for unknown servers if DIRECT is used?
No, it injects that if auto is used, there’s no squid found, and the servers are not using Cloudflare. If DIRECT is used, WPAD will not be contacted.
Is there some other conditional that enables the use of WPAD, like CVMFS_CLIENT_PROFILE = single or CVMFS_USE_CDN= yes ?
WPAD is used by default if CVMFS_CLIENT_PROFILE=single and CVMFS_HTTP_PROXY is unset, or any time somebody includes auto in CVMFS_HTTP_PROXY. CVMFS_USE_CDN=yes does not affect the setting of CVMFS_HTTP_PROXY.
Hmm, I see that in the default config-repo, unlike the egi and osg config-repos, CVMFS_USE_CDNis not automatically set to yes when CVMFS_HTTP_PROXY=DIRECT or auto;DIRECT or even when CVMFS_CLIENT_PROFILE=yes. In effect it is set that way for the computecanada.ca domain (because that uses the CDN in those cases anyway), but not for everyone else. I think that is a problem especially with auto;DIRECT because the WPAD would then inject http://NULL. That should probably be fixed.
I’ll check tomorrow in a fresh VM and see if things are working, thanks so much for investigating this!
I evaluated CVMFS a few months ago when I originally made this post and decided not to use it at the time, but I was just about to re-evaluate it and contact the Compute Canada team, so this is great timing to find out that something actually was wrong.