dCache Problem descriptions
SRM problem
Description: SRMv1 read from the SE fails, even though globus-url-copy of the same file from the SE as well as from the pool node directly works fine.
Other Symptoms: The best information can be obtained from the pool cell of the affected pool:
srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:////tmp/derek-srm1-1
10.43.10 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.43.10 PoolIoFileMessage delivered to ioFile (method)
10.43.10 Dup Request : none
10.43.10 JOB queued 000200000000000000001778
10.43.10 JOB prepare 000200000000000000001778
10.43.10 Entry for 000200000000000000001778 found and is 000200000000000000001778 51200 si={cms:cms}
10.43.10 JOB run 000200000000000000001778
10.43.10 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.43.10 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=E, host=127.0.0.1:34210, buffer=0, passive=false, parallelism=10
10.43.10 (GFtp_2_nio/000200000000000000001778) Connecting to /127.0.0.1:34210
10.43.10 (GFtp_2_nio/000200000000000000001778) java.net.ConnectException: Connection refused
A copy directly from the pool that works, looks like this
globus-url-copy -dbg gsiftp://t3fs01.psi.ch/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:/tmp/mytest.tst
10.51.12 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.51.12 PoolIoFileMessage delivered to ioFile (method)
10.51.12 Dup Request : none
10.51.12 JOB queued 000200000000000000001778
10.51.12 JOB prepare 000200000000000000001778
10.51.12 Entry for 000200000000000000001778 found and is 000200000000000000001778 51200 si={cms:cms}
10.51.12 JOB run 000200000000000000001778
10.51.12 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.51.12 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=S, host=t3fs01:32916, buffer=0, passive=false, parallelism=5
10.51.12 (GFtp_2_nio/000200000000000000001778) Connecting to t3fs01/192.33.123.41:32916
10.51.12 (GFtp_2_nio/000200000000000000001778) Opened Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12 (GFtp_2_nio/000200000000000000001778) Entering event loop
10.51.12 (GFtp_2_nio/000200000000000000001778) Closing Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12 (GFtp_2_nio/000200000000000000001778) Multiplexer shutting down
10.51.12 (GFtp_2_nio/000200000000000000001778) Left event loop and closing channels
10.51.12 (GFtp_2_nio/000200000000000000001778) Transfer finished: 51200 bytes transferred in 0.09 seconds = 0.568 MB/s
10.51.12 000200000000000000001778;length=51200;timer=Null Time elapsed
10.51.13 IO thread finished : Thread[Scheduler-IO-1-worker,5,t3fs01_cms_1-threads]
In the SRM case, somehow the pool tries to connect to "/127.0.0.1:[port]" instead of to "t3fs01/192.33.123.41:[port]". The hostname is an empty string, it seems. Unclear what is used in the real connect. May also be that the listening socket is not for ALL addresses of the host (including localhost) but only for the named interface.
To identify the issue, we attach a truss to the pool process on the pool node:
truss -v connect -f -p 2954 &> /tmp/truss
Looking through the output for both cases, we find that in the case of the SRM mediated copy, the connect indeed is tried to the localhost address, while in the direct gridftp case, the named interface was used:
# SRM mediated copy
2954/82: connect(25, 0xD10B0850, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
2954/82: AF_INET name = 127.0.0.1 port = 34235
# standalone gridftp copy
2954/82: connect(24, 0xD10B0850, 16, SOV_DEFAULT) = 0
2954/82: AF_INET name = 192.33.123.41 port = 32930
A reference installation on the Tier-2 shows that a Solaris-1 node also only listens to the named interface based on the netstat output:
se05# netstat
UDP: IPv4
Local Address Remote Address State
-------------------- -------------------- ----------
se05.32772 mon.8671 Connected
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
-------------------- -------------------- ----- ------ ----- ------ -----------
se05.41053 se09.60559 49640 0 49640 0 TIME_WAIT
se05.24411 wn08.52485 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52486 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52487 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52488 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52489 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52490 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52491 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52492 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52493 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52494 5840 0 49232 0 TIME_WAIT
se05.ssh ui.59621 9056 0 49232 0 ESTABLISHED
se05.65360 storage01.52323 131072 0 49640 0 ESTABLISHED
se05.65361 storage01.52323 131072 0 49640 0 ESTABLISHED
se05.65362 storage01.52323 130944 0 49640 0 ESTABLISHED
se05.65363 storage01.52323 131072 0 49640 0 ESTABLISHED
--
DerekFeichtinger - 28 Aug 2008