I have a rather large repository. It is divided up into directories based on micro-architecture and compiler. However, one of the directories is larger than others. I decided to use the “Automatic Management of Nested Catalogs” feature. However, that leads to messages like the following:
Couldn't create a new nested catalog in any subdirectory of '/2022.1/apps/linux-centos7-x86_64/gcc-9.4.0' even though currently it is overflowed
Looking at the output of cvmfs_server list-catalogs
I think that an occasional nearly 300K-entry catalog, as long as it isn’t the root catalog, is probably not going to be very harmful and you can ignore it.
Would it help if I bump CVMFS_AUTOCATALOGS_MAX_WEIGHT to 200000. As far as I can tell, it defaults to 100000. I am thinking that if the deeper level catalogs are larger than the upper level catalog would be smaller. If I change that variable setting how do I force a regeneration of the catalogs?
It seemed the only way to regenerate them was to delete the .cvmfcatalog and .cvmfsautocatalog files, followed by a transaction/publish cycle. I thought there might be an easier way to do that.
The main reason I don’t trust it is that I don’t see how it can make good choices about where to put the catalogs without having any understanding about which pieces might be used together. If every case is going to load all the subcatalogs, there’s not much efficiency gained by splitting it up.
There is another tool that is helpful for determining which .cvmfsdirtab patterns to include. I often use the “catdirusage” tool that comes in GitHub - cvmfs-contrib/python-cvmfsutils which tells you how many files are in the current catalog in each subdirectory under a supplied path, sorted in increasing order. Example usage for the root directory on grid.cern.ch: