Tags, rollbacks, and garbage collection

Dear all

I have to admit that I am not sure to have fully understood automated/explicit
tags, rollbacks, and garbage collection

My use case is the following:

a- I have repos that are updated quite often
b- I want to be able to rollback to old (but < 30 days) transactions
c- I want that data used in old (> 30 days) transactions get deleted if not used anymore
d- I am fine if the GC is done when I do a transaction on the repo

My understanding is that this cam be implemented with these settings
in the server.conf file:

# Remove deleted fles not used anymore that were used in transactions older than 30 days
CVMFS_AUTO_TAG_TIMESPAN="30 days ago"
CVMFS_GARBAGE_COLLECTION=true
CVMFS_AUTO_GC=true
CVMFS_AUTO_GC_TIMESPAN="30 days ago"
CVMFS_AUTO_TAG=true
CVMFS_GC_DELETION_LOG=/var/log/cvmfs/sgaravat.infn.it-gc.log

CVMFS_AUTO_TAG=true (even if the documentation suggests to set to false for
GC-able repos) would be needed to have the automatic tags, needed for
rollback operations, even if the “-a” option was not used with the
publish, e.g.:

cvmfs_server rollback -t "generic-2024-09-05T06:10:20.821Z" sgaravat.infn.it

I would appreciated a feedback. Did I get it right ?

Thanks, Massimo

That looks right to me.

Dave

I tried this setting (but with “1 days ago” instead of 30 days).

I wrote some files in the repo and I see that the space in the backend storage increased correspondingly

Then in another transaction I removed these files

2 days later, I made another transaction. The tags referring to the “old” transactions where these data were written and removed were deleted, but it looks like the GC didn’t run (and indeed the storage space used in the S3 backend didn’t decrease for this repo)
Was the GC supposed to run during this transaction ?

Thanks, Massimo

The frequency of auto gc is based on CVMFS_AUTO_GC_LAPSE, which defaults to 1 day ago. So after 2 days I think it should have run. CVMFS_AUTO_GC_TIMESPAN sets how long to keep transactions. That shouldn’t be set as low as 1 day ago. Maybe setting it as low as CVMFS_AUTO_GC_LAPSE confused it.

Dave

Thanks a lot for your help
I will then try with a longer CVMFS_AUTO_GC_TIMESPAN

I am then confused about the GC setting on the stratum1s.
If the repo on the stratum 0 is configured as the first post in this thread, should I use the -g option in “add-replica”, e.g.:

cvmfs_server add-replica -z https://rgw.cloud.infn.it:443/cvmfs/sgaravat.infn.it /etc/cvmfs/keys/infn.it/common.infn.it.pub

?

And then I simply have just to regularly run “cvmfs_server gc -af” ?

Thanks for your patience

Cheers, Massimo

No, the -g option on add-replica has nothing to do with gc, and the -z option does not exist there. The fact that a repository is garbage collectable gets encoded in the .cvmfspublished file, and the stratum 1 cvmfs_server gc -af option looks at those to find out which repositories are to be gc’ed. Yes, it is recommended to run that regularly on stratum 1s, typically once a week on weekends. Snapshots are blocked on a repository while it is being garbage collected.

Dave

Thanks a lot, but I still don’t understand when unused files are supposed to get removed in the stratum1.

If on the stratum 0 a publish operation triggers a GC and the deletion of some files on the backend, e.g:

[sgaravat@lxsgaravat S3]$ s3cmd -c s3cfg-backbone-cvmfs.cfg du -H s3://cvmfs/testgarbage.infn.it/
   4G     553 objects s3://cvmfs/testgarbage.infn.it/
[root@cvmfs-s0-s3cloudveneto ~]# cvmfs_server transaction testgarbage.infn.it
[root@cvmfs-s0-s3cloudveneto ~]# cvmfs_server publish testgarbage.infn.it
Using auto tag 'generic-2024-10-08T08:26:29Z'
Processing changes...
Waiting for upload of files before committing...
Committing file catalogs...
Wait for all uploads to finish
Exporting repository manifest
Statistics stored at: /var/spool/cvmfs/testgarbage.infn.it/stats.db
Tagging testgarbage.infn.it
Flushing file system buffers
Signing new manifest
Running automatic garbage collection
  --> marking unreferenced objects [Tue, 08 Oct 2024 08:26:35 GMT]
  --> sweeping unreferenced objects [Tue, 08 Oct 2024 08:26:35 GMT]
      - 33%    1 / 3 unreferenced revisions removed [Tue, 08 Oct 2024 08:26:37 GMT]
      - 67%    2 / 3 unreferenced revisions removed [Tue, 08 Oct 2024 08:26:37 GMT]
      - 100%    3 / 3 unreferenced revisions removed [Tue, 08 Oct 2024 08:26:37 GMT]
  --> done garbage collecting [Tue, 08 Oct 2024 08:26:38 GMT]
Statistics stored at: /var/spool/cvmfs/testgarbage.infn.it/stats.db
Remounting newly created repository revision
[root@cvmfs-s0-s3cloudveneto ~]#
[sgaravat@lxsgaravat S3]$ s3cmd -c s3cfg-backbone-cvmfs.cfg du -H s3://cvmfs/testgarbage.infn.it/
  24K      21 objects s3://cvmfs/testgarbage.infn.it/

so 4 GB of data removed from the backend used by the stratum0,

on the stratum1 should I expect that the unused files get removed immediately after having done a:

cvmfs_server snapshot testgarbage.infn.it
cvmfs_server gc -af

?

This is not happening in my setup …

server.conf in the stratum1 is the one created by the add-replica command, i.e.:

cvmfs_server add-replica -z https://rgw.cloud.infn.it:443/cvmfs/testgarbage.infn.it /etc/cvmfs/keys/infn.it/common.infn.it.pub

[root@s1-cvmfs-cnaf cvmfs]# cat /etc/cvmfs/repositories.d/testgarbage.infn.it/server.conf 
# Created by cvmfs_server.
CVMFS_CREATOR_VERSION=143
CVMFS_REPOSITORY_NAME=testgarbage.infn.it
CVMFS_REPOSITORY_TYPE=stratum1
CVMFS_USER=root
CVMFS_SPOOL_DIR=/var/spool/cvmfs/testgarbage.infn.it
CVMFS_STRATUM0=https://rgw.cloud.infn.it:443/cvmfs/testgarbage.infn.it
CVMFS_STRATUM1=http://localhost/cvmfs/testgarbage.infn.it
CVMFS_UPSTREAM_STORAGE=local,/srv/cvmfs/testgarbage.infn.it/data/txn,/srv/cvmfs/testgarbage.infn.it
CVMFS_SNAPSHOT_GROUP=
[root@s1-cvmfs-cnaf cvmfs]#

Thanks a lot: your help is really appreciated

Edit:

I retried after a while, and the “cvmfs_server gc -af” issued on the stratum1 deleted the unused files

This is not happening in my setup …

By “this” I assume you mean that the cleaning up of files didn’t happen. I assume the gc did run, correct? Anytime you run the gc directly (as opposed to the “auto” gc which runs as a side effect during stratum 0 publish) then the timespan for garbage collection is passed on the command line in the -t option rather than being set in the configuration. In the cvmfs_server script it looks like that defaults to 3 days ago. What do you have it set to on your stratum 0?

Dave

I have this settings in the stratum0:

CVMFS_AUTO_GC_TIMESPAN=“2 days ago”