Nested catalog management

glennpj · March 23, 2022, 4:50pm

I have a rather large repository. It is divided up into directories based on micro-architecture and compiler. However, one of the directories is larger than others. I decided to use the “Automatic Management of Nested Catalogs” feature. However, that leads to messages like the following:

Couldn't create a new nested catalog in any subdirectory of '/2022.1/apps/linux-centos7-x86_64/gcc-9.4.0' even though currently it is overflowed

Looking at the output of cvmfs_server list-catalogs

├─ 295093 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0

How big of a problem is the above? Is there anyway that I can resolve it without having to resort to splitting that directory up further?

dwd · March 23, 2022, 5:13pm

I think that an occasional nearly 300K-entry catalog, as long as it isn’t the root catalog, is probably not going to be very harmful and you can ignore it.

Dave

glennpj · March 23, 2022, 7:42pm

Would it help if I bump CVMFS_AUTOCATALOGS_MAX_WEIGHT to 200000. As far as I can tell, it defaults to 100000. I am thinking that if the deeper level catalogs are larger than the upper level catalog would be smaller. If I change that variable setting how do I force a regeneration of the catalogs?

Thanks.

dwd · March 23, 2022, 9:17pm

I don’t know much about the autocatalogs. I don’t trust them and so avoid them.

You probably don’t have any catalogs under the gcc-9.4.0 directory, do you?

The way to regenerate the catalogs is to do a transaction/publish cycle.

glennpj · March 23, 2022, 10:25pm

Okay; good to know they are not trusted. I do have catalogs under that directory. Here is a snippet:

├─ 91902 /2022.1/apps/linux-centos7-x86_64
│  ├─ 295093 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0
│  │  ├─ 1170 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/wxwidgets-3.0.2-jb5g72t
│  │  ├─ 1780 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/tk-8.6.11-eivwzz4
│  │  ├─ 15313 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/texlive-live-kzx3ktv
│  │  ├─ 91587 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/texlive-20210325-tutllta
│  │  │  ├─ 54243 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/texlive-20210325-tutllta/texmf-dist/fonts
│  │  │  │  ├─ 72461 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/texlive-20210325-tutllta/texmf-dist/fonts/tfm
│  │  ├─ 3740 /2022.1/apps/linux-centos7-x86_64/gcc-9.4.0/tcl-8.6.11-wsizrjd

It seemed the only way to regenerate them was to delete the .cvmfcatalog and .cvmfsautocatalog files, followed by a transaction/publish cycle. I thought there might be an easier way to do that.

dwd · March 24, 2022, 2:30pm

The main reason I don’t trust it is that I don’t see how it can make good choices about where to put the catalogs without having any understanding about which pieces might be used together. If every case is going to load all the subcatalogs, there’s not much efficiency gained by splitting it up.

There is another tool that is helpful for determining which .cvmfsdirtab patterns to include. I often use the “catdirusage” tool that comes in GitHub - cvmfs-contrib/python-cvmfsutils which tells you how many files are in the current catalog in each subdirectory under a supplied path, sorted in increasing order. Example usage for the root directory on grid.cern.ch:

$ catdirusage http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch /
1	    /util
10	    /pakiti
107	    /vc
214	    /arm8_gfal_testball
6241	/centos7-umd4-ui-211021
7649	/centos7-umd4-wn-4.0.5-1_191112
7669	/centos7-umd4-ui-4.0.3-1_191004
8662	/centos7-umd4-ui-151021
10956	/centos7-umd4-ui-4_200423
11865	/centos7-umd4-wn-4.0.5-1_191003
11900	/centos7-umd4-ui-4.0.3-1_191003