I have a couple of questions about what can be achieved trying to use CVMFS as an interface to an existing S3 data store.
Can grafting and external files be used to redirect clients to download datafiles from a public S3 bucket? is this possible if there is only a HTTPS address available for the bucket?
Can the S3-backed storage options for cvmfs_server mkfs be used to create a new cvmfs instance based on an existing S3 bucket? ideally with read-access only so the cvmfs server doesn’t modify anything in that bucket. Again, if so can the S3 be accessed over HTTPS?
For 1, yes, the tooling may need to be polished but that’s possible.
External files let you use your existing bucket, but it assumes that the directory structure is the same. You’ll still need to publish the file (for checksumming). If that’s /cvmfs/test.repo.org, it just lets you fetch the file also from $HTTP_EXTERNAL_URL/foo/bar Creating a Repository (Stratum 0) — CernVM-FS 2.12.6 documentation
The docs are a bit scattered here, let me prepare a better example .
This is possible in principle, but needs a bit more tooling (in the next release we’ll have the new ingestql command). The main question here is if you need checksumming.
Great, thanks for the quick reply! That all sounds good. I think the external file method suits my use-case better than the remote backend storage, as I assume the data transfer has to go S3->server->client in the second case instead of just S3->client?
I did have issues with the HTTPS part for a little bit but it all works smoothly after I downloaded a CA bundle PEM and stored it where the cvmfs process could actually access it… I’ll share my simple example below in-case it’s useful for anyone else. I can also confirm the same setup worked for accessing data from a public S3 bucket.
A couple of extra questions: (1) Could I confirm that when you use an external file the data transfer occurs directly between the client and the external server? It doesn’t go through the cvmfs server first?
And (2) for external data, you currently have to set the list of external urls to search in the client config. I assume there’s no way to set this on the serverside as part of the grafting process? so, for example, a client queries a cvmfs server which then says where external data is hosted, and the client uses that supplied URL to request the actual data?
Thanks!
Stephen
Example getting external, grafted file over HTTPS:
on my server I have a cvmfs repo made with the mkfs -X -Z none options suggested on the cvmfs docs for default external publishing. I use cvmfs_swissknife to calculate the file hash for grafting. I store the dummy file under the same name and path in my cvmfs repo as it appears in the remote url
Grab some CA certificates for external server authentication (HTTPS). I used the Mozilla CA certificate store from the curl website.
wget https://curl.se/ca/cacert.pem -P /etc/cvmfs/
On the client side I setup the default.local config to include my new repo, I add the public key to /etc/cvmfs/keys/graft.test.x.pub and put the following in /etc/cvmfs/cofig.d/graft.test.x.conf:
For question 1, yes the data goes direct from server to client.
For question 2, your assumption is correct, there’s not a way to encode the host server in the grafted repository. The feature was designed to be able to support distributed external data caches, so the clients need to know which one to contact. However, as of version 2.12 there is a server-side way to set the external servers, using CVMFS_EXTERNAL_METALINK. It’s a separate server that returns a list of external server URLs to contact.