Determining file usage data in Stratum 0

fayora · May 9, 2022, 10:03am

The project I am working on requires understanding how files in the CernVM-FS platform are being used. For example, the number of times a file has been accessed, or the last time a file was accessed (i.e., “accessed” = copied/downloaded by a client).

I see in the docs that I can turn on usage statistics on the client, but that would not work for this scenario because I want to know the consolidated usage on the server side, and for for all users.

I found how to get some of that info from the Apache2 logs, but the files are referenced with their UID and not their file name.

I then looked for the cvmfs file catalog database where I understand I can get the UID:file name cross-reference information, but I cannot find it! (i.e., mentioned here https://cvmfs.readthedocs.io/en/stable/cpt-details.html?highlight=UID#file-catalog but its location is not mentioned)

Questions:

What is the best way to get usage information for files on the server? (e.g., number of times a specific file has been accessed, last time a specific file was accessed, etc.)
What is the path to where the file catalog database is stored on the server?
Where else can I find the cross-reference information for UID:file name?

Many thanks!

dwd · May 9, 2022, 6:33pm

First of all, the easiest thing to do to find what files are being accessed on the server is to configure the client with CVMFS_SEND_INFO_HEADER=yes. Then if you include the cvmfs-info http header in the squid and apache logs it will include the original path in addition to the storage hash path. The OSG client configuration rpm always sets that, and the frontier-squid rpm includes that header by default in its logs.

Secondly, keep in mind that, due to the CVMFS caching design, server side statistics can never provide full usage statistics. You could count the number of times a URL shows up in the stratum 1 logs, but that’s not going to count the number of accesses to the site squids nor the number of times the same file is accessed on each client’s cache. Aggregating client statistics is the only way to do add up all accesses.

The hash of the root catalog in each repository is specified in the .cvmfspublished file, in the line beginning with ‘C’. The path is based on that hash, starting with ‘data/’ followed by the first 2 digits of the hash followed by ‘/’ and the rest of the hash. If you want to do any exploring inside the catalog I recommend the python-cvmfsutils package.

Dave

fayora · May 10, 2022, 3:41am

Thanks @dwd! Brilliant info. I will look into that client-side setting, and might come back with more questions depending on how it goes.