CVMFS S3 interface

Hi,

I have a couple of questions about what can be achieved trying to use CVMFS as an interface to an existing S3 data store.

  1. Can grafting and external files be used to redirect clients to download datafiles from a public S3 bucket? is this possible if there is only a HTTPS address available for the bucket?

  2. Can the S3-backed storage options for cvmfs_server mkfs be used to create a new cvmfs instance based on an existing S3 bucket? ideally with read-access only so the cvmfs server doesn’t modify anything in that bucket. Again, if so can the S3 be accessed over HTTPS?

Thanks!

Hi Stephen!

For 1, yes, the tooling may need to be polished but that’s possible.

The docs are a bit scattered here, let me prepare a better example .

  1. This is possible in principle, but needs a bit more tooling (in the next release we’ll have the new ingestql command). The main question here is if you need checksumming.

Cheers,
Valentin

Great, thanks for the quick reply! That all sounds good. I think the external file method suits my use-case better than the remote backend storage, as I assume the data transfer has to go S3->server->client in the second case instead of just S3->client?

I did have issues with the HTTPS part for a little bit but it all works smoothly after I downloaded a CA bundle PEM and stored it where the cvmfs process could actually access it… I’ll share my simple example below in-case it’s useful for anyone else. I can also confirm the same setup worked for accessing data from a public S3 bucket.

A couple of extra questions: (1) Could I confirm that when you use an external file the data transfer occurs directly between the client and the external server? It doesn’t go through the cvmfs server first?

And (2) for external data, you currently have to set the list of external urls to search in the client config. I assume there’s no way to set this on the serverside as part of the grafting process? so, for example, a client queries a cvmfs server which then says where external data is hosted, and the client uses that supplied URL to request the actual data?

Thanks!
Stephen

Example getting external, grafted file over HTTPS:

  1. here is the first image of a cat that comes up on google images. it’s a https address.
    https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_square.jpg
  2. on my server I have a cvmfs repo made with the mkfs -X -Z none options suggested on the cvmfs docs for default external publishing. I use cvmfs_swissknife to calculate the file hash for grafting. I store the dummy file under the same name and path in my cvmfs repo as it appears in the remote url
wget https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_square.jpg
sudo cvmfs_server transaction graft.test.x
cat NationalGeographic_2572187_square.jpg | sudo cvmfs_swissknife graft -i - -o /cvmfs/graft.test.x/NationalGeographic_2572187_square.jpg
sudo cvmfs_server publish graft.test.x
  1. Grab some CA certificates for external server authentication (HTTPS). I used the Mozilla CA certificate store from the curl website.
wget https://curl.se/ca/cacert.pem -P /etc/cvmfs/
  1. On the client side I setup the default.local config to include my new repo, I add the public key to /etc/cvmfs/keys/graft.test.x.pub and put the following in /etc/cvmfs/cofig.d/graft.test.x.conf:
CVMFS_SERVER_URL=http://my-url.com/cvmfs/graft.test.x
CVMFS_PUBLIC_KEY=/etc/cvmfs/keys/graft.test.x.pub
CVMFS_EXTERNAL_URL=https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e
X509_CERT_BUNDLE=/etc/cvmfs/cacert.pem
  1. Mount cvmfs repos and interrogate data
cvmfs_config probe
xdg-open /cvmfs/graft.test.x/NationalGeographic_2572187_square.jpg

Stephen,

For question 1, yes the data goes direct from server to client.

For question 2, your assumption is correct, there’s not a way to encode the host server in the grafted repository. The feature was designed to be able to support distributed external data caches, so the clients need to know which one to contact. However, as of version 2.12 there is a server-side way to set the external servers, using CVMFS_EXTERNAL_METALINK. It’s a separate server that returns a list of external server URLs to contact.

Dave