Swift (S3) Large Object Support

Hi all,

I’m attempting to set up a Swift-backed replica of some repositories that I run. Unfortunately, one of these repos contains some fairly large objects. I don’t control the format, these are genomic index files for specific bioinformatics tools, so the file format is controlled by the tool authors.

Swift has a 5-GB-per-object size limit, but does have support for splitting them up. I don’t have any experience with this but it looks like the client performing the upload needs support, but the downloading of large objects is transparent.

Is there any way to support this in the CVMFS S3 code? Here’s what a debug level 3 output of snapshotting looks like:

--> Object: 'data.galaxyproject.org/data/2d/4dd2166358b4429e9aaf86accb6962a0f802fe'
--> Bucket: 'galaxy-cvmfs'
--> Host:   'js2.jetstream-cloud.org:8001'
    [03-25-2022 19:41:08 UTC]
(s3fanout) HEAD string to sign for: data.galaxyproject.org/data/2d/4dd2166358b4429e9aaf86accb6962a0f802fe    [03-25-2022 19:41:08 UTC]
(s3fanout) CallbackCurlSocket called with easy handle 0x7fbdf0000c60, socket 35, action 2, up 29892672, sp 0, fds_inuse 2, jobs 1    [03-25-2022 19:41:08 UTC]
(s3fanout) curl_multi_socket_action: 0 - 1    [03-25-2022 19:41:08 UTC]
(s3fanout) CallbackCurlSocket called with easy handle 0x7fbdf0000c60, socket 35, action 1, up 29892672, sp 0, fds_inuse 3, jobs 1    [03-25-2022 19:41:08 UTC]
(s3fanout) http status error code [info 0x7fbdd40013d0]: HTTP/1.1 404 Not Found
    [03-25-2022 19:41:08 UTC]
(s3fanout) CallbackCurlSocket called with easy handle 0x7fbdf0000c60, socket 35, action 4, up 29892672, sp 0, fds_inuse 3, jobs 1    [03-25-2022 19:41:08 UTC]
(s3fanout) Verify uploaded/tested object data.galaxyproject.org/data/2d/4dd2166358b4429e9aaf86accb6962a0f802fe (curl error 0, info error 6, info request 1)    [03-25-2022 19:41:08 UTC]
(s3fanout) not found: data.galaxyproject.org/data/2d/4dd2166358b4429e9aaf86accb6962a0f802fe, uploading    [03-25-2022 19:41:08 UTC]
cvmfs_swissknife_debug: /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc8/build/BUILD/cvmfs-2.9.0/cvmfs/s3fanout.cc:791: bool s3fanout::S3FanoutManager::MkPayloadHash(const s3fanout::JobInfo&, std::__cxx11::string*) const: Assertion `nbytes == info.origin->GetSize()' failed.
/usr/bin/cvmfs_server: line 7404: 1741113 Aborted                 $user_shell "$(__swissknife_cmd dbg) pull -m $name         -u $stratum0                                           -w $stratum1                                           -r ${upstream}                                         -x ${spool_dir}/tmp                                    -k $public_key                                         -n $num_workers                                        -t $timeout                                            -a $retries $with_history $with_reflog                    $initial_snapshot_flag $timestamp_threshold $log_level"

I created this topic in the Stratum 1 forum since that’s what I’m working with, but presumably the same applies for a Swift-backed Stratum 0 as well.

Thanks!
–nate

Hi Nate,

Apologies for the late reply! I’ll review the forum notification settings to prevent stale forum entries in the future.

I think the best approach is to use the cvmfs application level object splitting. While it is conceivable to support SWIFT’s format, in general large objects (say > 100MB) are causing disproportionately heavy stress on the infrastructure, from stratum 0 to the site caches down to the client caches. If you can contact the repository owners (or put us directly in touch), I’d advise to update the cvmfs-server package to the latest version and to ensure that CVMFS_GENERATE_LEGACY_BULK_CHUNK is turned off or unset and CVMFS_USE_FILE_CHUNKING is turned on or unset. That would automatically chunk newly published large files into blocks of a few tens of megabytes each.

If the repository makes use of versioning, it is possible that the latest version already uses chunking and it’s just the historic versions that still contain large objects. If you can give me a stratum 1 URL, I can check if this is the case.

Once the publishing settings are reviewed, the repository owners can republish existing large files to ensure they get chunked. To do so, it is sufficient to call touch /cvmfs/<file name> in a transaction. Please let me know if you need instructions on how to identify large, unchunked files in the repository. We have done this “chunking campaign” for the LHC repositories some time ago. It does require a day or so of effort per repository but it is a one-time maintenance operation and it pays off for the entire cvmfs infrastructure.

Cheers,
Jakob

Hi Jakob,

This is fantastic to hear. I am the repo owner so I can perform this for my own repos. A Stratum 1 if you’d like to investigate is cvmfs1-psu0.galaxyproject.org, and the repos with the largest chunks would be data.galaxyproject.org and singularity.galaxyproject.org. Instructions for finding large unchunked files would be greatly appreciated, thanks!

Would I need to remove old repo revisions created prior to chunking, or is touching the large files in new revisions all that’s necessary? Will all the Stratum 1s that replicate these repos remove the old unchunked files as part of GC?

For historical purposes it might be helpful for us not to change the mtimes on this data. Is updating the atime enough to force chunking?

Thanks,
–nate

Thanks for the stratum 1 URL! I’m trying to get

curl -I http://cvmfs1-psu0.galaxyproject.org/cvmfs/data.galaxyproject.org/.cvmfspublished

But I get a squid error there:

HTTP/1.1 500 Internal Server Error
Server: squid/3.5.20
Mime-Version: 1.0
Date: Mon, 04 Apr 2022 15:18:22 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 4029
X-Squid-Error: ERR_CANNOT_FORWARD 0
Vary: Accept-Language
Content-Language: en
X-Cache: MISS from cvmfs1-psu0.galaxyproject.org
X-Cache-Lookup: MISS from cvmfs1-psu0.galaxyproject.org:80
Via: 1.1 cvmfs1-psu0.galaxyproject.org (squid/3.5.20)
Connection: keep-alive

It would be useful if I can checkout the repository first. We have a conversion utility that removes the bulk versions of files: cvmfs_server eliminate-bulk-hashes. This only works, however, if all large files also have a chunked version (usually the case) and if previous versions with bulky files are not explicitly tagged as a cvmfs repository snapshot. Older, non-snapshotted file system snapshots are automatically handled by the garbage collection.

Cheers,
Jakob

Whoops, Apache had been OOM-killed. It’s up now.

All of our revisions are named snapshots… so I gather we are going to have problems here. It was never clear to me what benefit there was to not exclusively using named snapshots, although I see that the docs now recommend having no more than 50 of them.

I see, the named snapshots are a problem because we unfortunately don’t have the tools to edit them.

I’m considering two possible approaches: we can extend the catalog cleanup utility such that it also processes the named snapshots and writes out a new history database. This will take a little time to develop and test it. A faster approach is a small patch to the replication code that skips the large bulk objects. While that would technically result in a broken stratum 1, it would most likely not matter in practice because for many year clients do not request anymore the complete object if chunks exist. I’ll discuss it in the team, if you have any opinion on either option, please let us know.

In any case, it is a good idea to remove the bulk hashes for the catalog head so that at least the top of the history is clean.

Cheers,
Jakob

As an alternative, could we simply remove all the named snapshots?

That’s a possibility if you can do without them…?

So I understand, this doesn’t remove the data in the snapshots, it simply removes the ability to mount then at that revision?

If you have garbage collection enabled, then the files that are only referenced by the historic snapshots will be removed. Also, stratum 1 servers won’t replicate the data from these removed snapshots (which is of course why this approach would work). If this is an issue, let us check how much work it would be to extend the eliminate-bulk-hashes tool.

Are only the files added/changed in a snapshot referenced in a snapshot? Or does it include every file that exists in the repo? Assuming it’s only added/changed, then yes, removing the named snapshots would remove most of the data in our repos.

The snapshots work like tags in git. So if you have file /foo in snapshots S1 and S2, then removing only S1 won’t delete /foo. But removing S1 and S2 and running GC will remove /foo. There is always the snapshot called trunk (the latest file system state), so what you see when you mount cvmfs is not removed by deleting older snapshots.

Ah ok, so if all we care about is how the repo looks in trunk, then it is safe to remove all the old snapshots. The large repos in question are generally additive, we aren’t removing anything from them.

Thanks!

In any case, you can save the current history with cvmfs_server tag -lx, which shows the root hashes attached to each tag. I think we should add support to the catalog migration utility to also fix-up named snapshots. Once this is done, you could then re-install the repository history if there was no garbage collection run in the meantime.