CVMFS to distribute software to laptops of organization?

angel.de.vicente · January 14, 2024, 5:48pm

Hello,

I just learned about CVMFS recently and I’m starting to get familiar with it (so far with the client side, trying the EESSI repo).

I wanted to ask for opinions on whether it could be a good fit for distributing some software to the laptops of our organization? The situation is that, for security issues, users cannot install software in their laptops, so anytime that they need some scientific software from the list of software that we support, they have to contact the support team, which probably means a lot of unnecessary work for the support team and delay for the user.

I haven’t looked into the server side of things yet, but by using the client with the EESSI repo, I think that the experience to the user could be a nice one, since as far as the laptop has internet connection, the user could “install” any software by just trying to use it (the first time it will be slower, but certainly faster than asking the support team for a traditional install) and by having a sufficiently large client cache configured, most users would actually have all the software they regularly use in their laptops, and they could work properly and fast most of the time even without internet connection.

Is there something in CernVM-FS that would discourage this scenario? Since I was planning this just for laptops at our organization, I was thinking that a Stratum-0 server in our LAN, and perhaps a further Stratum-1 for HA, would be sufficient, since users (around 200-300) would have access to the LAN, either by working at our premises or by connecting via VPN.

Any ideas/suggestions/pointers very welcome.

Thanks a lot,
Angel de Vicente

vavolkl · January 15, 2024, 9:04am

Hi Angel,

welcome to the forum!

Even though the primary use case is software distribution to clusters, in practice CVMFS is already used on many laptops. Many developers use it for the ease of setting up dependencies and having the same setup as others to reproduce a problem.

So I would say that this is indeed possible, however the user satisfaction will very much depend on the quality of the network. With slower connections, there will be noticeable latency in interactive use, and of course there is the downside that offline use is currently still limited - you can only use what you have already downloaded in the cache (and this is on a file-by-file basis, not a package-by-package basis, so setting up / importing a python module might not mean that all the files of the module are in the cache). Allowing the caching of full packages is something that we are looking to implement.

One important point is that EESSI is already very comprehensive, but you might still get requests for new software that is not available there, so you should try if the process of adding new software in EESSI or a dedicated repository is working for you ( it should be easy! ).

CVMFS should also be easy to setup across many machines, and it’s also possible to complement it with other methods - many developers will set up an environment from CVMFS and then pip install something on top.

For the 200-300 users the server infrastructure seems sufficient, it will likely be good to add some caching.

I hope this helps - feel free to follow up if you have further questions. I think I would also encourage you to discuss with EESSI about the software availabilities and additions.
Cheers,
Valentin

angel.de.vicente · January 15, 2024, 9:47am

Hello,

many thanks for the reply. This answers one of the questions I had (whether the local cache was file-by-file or package-by-package). I agree that it would probably be better for my scenario to have package-by-package cache, but am I right that if the cache never gets full, then nothing will ever be deleted from the laptop? I was thinking on setting aside something like a 50GB cache in each laptop, which should be more than enough to store all the applications that one of our regular users would need, so then I was hoping that after the latency of the first use, usage would feel quite similar to a local install?

In any case, your reply seems encouraging enough for me to investigate further, so I will try to go the EESSI route or perhaps a dedicated repository (since we have a lot of domain-specific software), and surely will come back with more questions

Thanks,

angel.de.vicente · January 16, 2024, 5:48pm

Today I started playing with the CVMFS server, just to get an idea of what is involved, but I’m having difficulties to get my first server-client connection made. What would be the best troubleshooting strategy?

So far I just installed a virtual machine with Ubuntu, installed cvmfs-server, and I followed the steps to create a basic repository. This step seems to be ok:

vagrant@ubuntu-jammy:~$ sudo cvmfs_server info sie.iac.es
Repository name: sie.iac.es
Created by CernVM-FS 143
Stratum1 replication allowed: yes
Whitelist is valid for another 29 days

Client configuration:
Add sie.iac.es to CVMFS_REPOSITORIES in /etc/cvmfs/default.local
Create /etc/cvmfs/config.d/sie.iac.es.conf and set
  CVMFS_SERVER_URL=http://localhost/cvmfs/sie.iac.es
  CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/sie.iac.es.pub
Copy /etc/cvmfs/keys/sie.iac.es.pub to the client

In my client laptop (this one I used to connect successfully to atlas.cern.ch and software.eessi.io), I did:

in /etc/cvmfs/default.local I have:
CVMFS_REPOSITORIES=atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,sft.cern.ch,lhcb.cern.ch,lhcbdev.cern.ch,software.eessi.io,sie.test,sie.iac.es
in /etc/cvmfs/domain.d/iac.es.conf I have:

if [ "$CVMFS_CONFIG_REPO_DEFAULT_ENV" = "" ]; then
  # Use the configuration in this package only if the config repository is not                                                                                                                 
  # mounted. Note that in this case the cvmfs client writes a warning to syslog                                                                                                                
  # because CVMFS_CONFIG_REPOSITORY is set.                                                                                                                                                    

  # Stratum 1 servers for the eessi.io domain                                                                                                                                                  
  if [ "$CVMFS_USE_CDN" = "yes" ]; then
    CVMFS_SERVER_URL="http://s1cern-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1ral-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1bnl-cvmfs.openhtc.io/cvmfs/@fqrn@;http://s1fnal-cvmfs.openhtc.io/cvmfs/@\
fqrn@"
  else
    CVMFS_SERVER_URL="http://xxx.xxx.xxx.xxx/cvmfs/@fqrn@"
  fi

  # Public keys for the eessi.io domain                                                                                                                                                        
  CVMFS_KEYS_DIR=/etc/cvmfs/keys/iac.es

  # The cern.ch stratum 1 servers support the Geo-API                                                                                                                                          
  CVMFS_USE_GEOAPI=yes
fi

(this is a copy of the file for the EESSI configuration I have, just changing CVMFS_SERVER_URL and CVMFS_KEYS_DIR. The IP address is that of the local virtual machine, and I have verified that I can ping to it from the client).

In the directory /etc/cvmfs/keys/iac.es I have copied as iac.es.pub the public file /etc/cvmfs/keys/sie.iac.es.pub from the server.

In the client /etc/fstab file I have the line:
sie.iac.es /cvmfs/sie.iac.es cvmfs noauto,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.idle-timeout=5min,x-systemd.device-timeout=10,_netdev 0 0

and I reload and mount all entries in the file.

But when I do a
ls /cvmfs/sie.iac.es

I just get:

ls: cannot access '/cvmfs/sie.iac.es': No such file or directory

Being new to all this, I’m not sure if the problem is with the client configuration, the access to the repository or the repository configuration or something else.

Any pointers on how to best debug this?

Many thanks,

angel.de.vicente · January 16, 2024, 6:13pm

Never mind. From my laptop I could not access the apache server inside the virtual machine. Some port forwarding later, I’m now able to mount without problems the newly created repo.
Cheers,

rptaylor · January 16, 2024, 7:34pm

Yes I think this can be a nice use case for CVMFS.

It’s important to consider the software compatibility aspects too.
If the laptops at your lab are managed devices, e.g. only one operating system needs to be supported, it should be relatively straightforward to just distribute software compiled for that specific OS (i.e. with dependencies on libraries provided by that OS).

However, the technique we developed ensures that the software on CVMFS has all dependencies internally satisfied, and does not depend on any libraries from the host OS, not even glibc. This way the software is guaranteed to run identically on every version/flavour/distribution of Linux. It is sort of like a form of software-level virtualization, or like the advantages of both dynamic linking and static linking but without the disadvantages of either. EESSI followed us in using the same technique in building their software stack. It can be a lot of work to build a comprehensive software stack in this manner. The Compute Canada stack , /cvmfs/soft.computecanada.ca is available out of the box and provides a great deal of scientific packages, so if that or the EESSI repo meet your needs it might be worth a thought using one of the existing repositories.
https://docs.alliancecan.ca/wiki/Accessing_CVMFS

Another thing to keep in mind is accessibility of the repositories. You mention only deploying stratum servers in your LAN, but laptops can go anywhere and people travel around the world. The latency of connecting remotely over a VPN to a stratum server from a long distance could make the repository sluggish. Also caching proxy servers are an important layer in content delivery with CVMFS, but not a good fit for mobile devices that can be anywhere on any network. For that reason we use a commercial CDN (Cloudflare) to ensure that content can be cached anywhere in the world where users are accessing our repos. (The openhtc.io servers do the same thing).

Also yes files remain in the local cache; it only needs to get cleaned up if it gets full.
Also the objects in the cache are actually file chunks ( <= whole files), related to deduplication which can save a lot of space in the caches.

angel.de.vicente · January 18, 2024, 12:01am

Many thanks for the reply. I will take note of all the points that you raise and in the coming weeks study whether CVMFS is a good fit four our needs.

As of today I have a proper server and a repository with some of the software that we support, and I will slowly migrate some of our machines to use this service instead of the actual system.

My idea is to monitor possible problems that may arise for the users, and also the load in the server machine, etc.

I hope to learn a lot during these coming weeks, and I’m certain I’ll come back at some point with further questions.

Cheers,