SingCVMFS: Transport endpoint is not connected

alboyer · November 22, 2022, 5:18pm

Hello,

I am trying to use cvmfsexec, and more specifically singcvmfs in a cluster where CVMFS is not mounted on the nodes.
I installed the package using the following commands:

git clone https://github.com/cvmfs/cvmfsexec.git

# The /tmp space is not available
export TMPDIR=/path/to/a/tmp/directory

cvmfsexec/makedist -s default
cvmfsexec/makedist -s -o /path/to/singcvmfs

Then I have used the package such as:

export SINGCVMFS_REPOSITORIES="<repo1>,<repo2>"
export SINGCVMFS_IMAGE="docker://centos:7"
export SINGCVMFS_LOGLEVEL="debug"
./singcvmfs ls /cvmfs/<repo1>

The command fails with the following error message: cannot access /cvmfs/<repo1>: Transport endpoint is not connected.

In the logs (/var/log/cvmfs/<repo1>.log), I can read the following message: (<repo1>) could not acquire workspace lock (9) (9 - cache directory/plugin problem)
I have also tried to use another cache directory such as:

truncate -s 6G scratch.img
mkfs.ext3 -F -O ^has_journal scratch.img

export SINGCVMFS_CACHEIMAGE=scratch.img
./singcvmfs ls /cvmfs/<repo1>

In this case <repo1>.log provided a different message: cannot create workspace directory /var/lib/cvmfs/shared.
Setting SINGCVMFS_CACHEDIR with values such as /test does not help: the container does not start: FATAL: container creation failed: hook function for tag prelayer returns error: /test doesn't exist in image /path/to/scratch.img

Here are some details about the environment:

OS: CentOS Linux release 7.9.2009 (Core)
Singularity version: 3.8.1 (allow setuid = yes)
user.max_user_namespaces = 0

Have you already encountered this issue? Any pointers that could help?
Should you need any further details, please let me know.

Thanks

dwd · November 22, 2022, 7:31pm

That was quite insightful of you to discover the SINGCVMFS_CACHEIMAGE option since it was added less than a month ago and I see it is not mentioned at all in the documentation. I think you’re the first one to try it besides me. I don’t think I tried it with setuid mode.

I see that the problem is that the root of the filesystem is owned by the root user and you’re not allowed to create a directory there. The singularity overlay create command gets around their similar problem by using the -d option to mkfs.ext3, which is unfortunately not yet supported in the el7 version. You can however compile it from source code or run it on a newer machine. Then create a scratch directory containing “shared” and pass it to the mkfs.ext3 command:

$ mkdir -p tmp/shared
$ truncate -s 6G scratch.img
$ mkfs.ext3 -F -O ^has_journal -d tmp scratch.img

On a machine with unprivileged user namespaces this is not necessary.

I will update the singcvmfs documentation.

Dave

alboyer · November 23, 2022, 12:18pm

Thanks a lot for the details.
I have followed your instructions, I created the scratch.img in a newer machine and exported it to the cluster.
Unfortunately, I still have this issue in /var/log/cvmfs/<repo1>.log (the one I got when I was not using the SINGCVMFS_CACHEIMAGE):

(<repo1>) could not acquire workspace lock (13) (9 - cache directory/plugin problem)

The debug mode does not provide more details. Any idea of how I could debug this?

Thanks

dwd · November 23, 2022, 8:56pm

I am assuming the underlying problem is that the filesystem you’re using does not support locking. Where do you have the cvmfsexec files, given that there’s no /tmp? On a networked filesystem? If so, what type? Is there any local disk you could use?

Otherwise you could put /var/log/cvmfs also onto a bind-mounted ext3 filesystem. I can’t offhand think of a way to use the same .img file, but you could make another one called log.img with a writable “log” directory and set SINGULARITY_BINDPATH=log.img:/var/log/cvmfs:image-src=log. That results in a warning because singcvmfs tries to separately bind that directory, so if that’s what you want to do permanently then there should probably be a modification made to the singcvmfs script to explicitly support a different bind for the logs.

Dave

alboyer · November 25, 2022, 9:57am

Indeed, seems like there is an issue with the shared file system.
I exported the singcvmfs executable and scratch.img on my local machine to have a local disk space:

it turns out that I have the same issue on a local disk space when I use SINGCVMFS_CACHEIMAGE=scratch.img, namely:

cannot access /cvmfs/<repo1>: Transport endpoint is not connected.

And in /var/log/cvmfs/<repo1>.log I have:

(<repo1>) could not acquire workspace lock (9) (9 - cache directory/plugin problem)

If I remove SINGCVMFS_CACHEIMAGE, I have no issue, it works well.

I managed to get access to /tmp, a local disk space, on the HPC cluster.

I have the same issue if I use SINGCVMFS_CACHEIMAGE

If I unset the variable, then I have a different issue:
In /var/log/cvmfs/<repo1> I can find the following logs:

Fri Nov 25 03:07:46 2022 (<repo1>) GeoAPI request http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/lhcb.cern.ch/api/v1.0/geo/@proxy@/cernvmfs.gridpp.rl.ac.uk,cvmfs-s1.hpc.swin.edu.au,cvmfs-s1bnl.opensciencegrid.org,cvmfs-s1fnal.opensciencegrid.org,cvmfs-stratum-one.cern.ch,cvmfs-stratum-one.ihep.ac.cn,cvmfsrep.grid.sinica.edu.tw failed with error 15 [host serving data too slowly]
Fri Nov 25 03:08:07 2022 (<repo1>) GeoAPI request http://cvmfs-s1.hpc.swin.edu.au:8000/cvmfs/lhcb.cern.ch/api/v1.0/geo/@proxy@/cernvmfs.gridpp.rl.ac.uk,cvmfs-s1.hpc.swin.edu.au,cvmfs-s1bnl.opensciencegrid.org,cvmfs-s1fnal.opensciencegrid.org,cvmfs-stratum-one.cern.ch,cvmfs-stratum-one.ihep.ac.cn,cvmfsrep.grid.sinica.edu.tw failed with error 15 [host serving data too slowly]
Fri Nov 25 03:08:22 2022 (<repo1>) GeoAPI request http://cvmfs-s1bnl.opensciencegrid.org:8000/cvmfs/lhcb.cern.ch/api/v1.0/geo/@proxy@/cernvmfs.gridpp.rl.ac.uk,cvmfs-s1.hpc.swin.edu.au,cvmfs-s1bnl.opensciencegrid.org,cvmfs-s1fnal.opensciencegrid.org,cvmfs-stratum-one.cern.ch,cvmfs-stratum-one.ihep.ac.cn,cvmfsrep.grid.sinica.edu.tw failed with error 15 [host serving data too slowly]
Fri Nov 25 03:08:22 2022 (<repo1>) failed to retrieve geographic order from stratum 1 servers

Hence, I have a few more questions:

Even if I don’t use this option eventually, have you noticed the same issue with the scratch.img process?
Is there a way to fix, or at least bypass, the host serving data too slowly error? A timeout option for instance.

Thanks for your help

dwd · November 28, 2022, 10:40pm

I’m sorry, I was misinterpreting the “could not acquire workspace lock” message. I was thinking it was the log directory it was complaining about, but it’s actually the cache directory. The only thing I can think of is maybe it was improperly unmounted? It works for me. You can try running e2fsck on scratch.img.

Regarding the GeoAPI request failures, can you do those requests with curl or wget? Replace @proxy@ with the name or IP address of a local machine.

alboyer · December 1, 2022, 4:05pm

So, regarding the problem with the cache directory: e2fsck did not return any error but I created a new scratch.img following your instructions and it worked this time. This can be considered as solved I guess.

Regarding the GeoAPI request failures, I am not sure to understand how I should replace @proxy@. What do you mean by “the name or IP address of a local machine”? Do you have any example?

I’ve tried the following (wrong) request on my local machine and then on the cluster:

curl http://cernvmfs.gridpp.rl.ac.uk:8000/cvmfs/lhcb.cern.ch/api/v1.0/geo/test/cernvmfs.gridpp.rl.ac.uk

# on my local machine
> 1
# on the cluster
> curl: (7) Failed to connect to 2001:630:58:1800::82f6:b52b: Network is unreachable

You might think that the cluster has no external connectivity, but I can reach addresses such as https://www.google.com for instance.

(Just to precise, the main issue is still the same, namely: Transport endpoint is not connected.)

dwd · December 1, 2022, 7:55pm

You can actually put any string in there, it just makes a difference for caching the result. Usually cvmfs clients insert the name of their caching web proxy so that all users of the same proxy get the same result, but “test” should work too.

I would say there’s definitely an issue with networking there. Talk to your local system administrator. If you use the curl “-4” or “-6” options you can force it to try ipv4 or ipv6 to isolate issues. I can successfully use them both from an IPv6-capable machine and with the test URL you used I get the expected answer of “1”.

The Transport endpoint is not connected symptom can be caused by a lot of different failures that cause cvmfs to exit, so it’s really not very helpful. The logs give you the details.