dCache Problem descriptions
Problem Description
SRMv1 read from the SE fails, even though globus-url-copy of the same file from the SE as well as from the pool node directly works fine.
Example command
srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:////tmp/derek-srm1-1
Show srmcp debug output Hide
[feichtinger@t3ui01 ~]$ srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file://t3ui01.psi.ch//tmp/derek-srm1-1
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
Storage Resource Manager (SRM) CP Client version 2.0
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory
SRM Configuration:
debug=true
gsissl=true
help=false
pushmode=false
userproxy=true
buffer_size=131072
tcp_buffer_size=0
streams_num=10
config_file=config.xml
glue_mapfile=conf/SRMServerV1.map
webservice_path=srm/managerv1
webservice_protocol=https
gsiftpclinet=globus-url-copy
protocols_list=http,gsiftp
save_config_file=null
srmcphome=..
urlcopy=sbin/urlcopy.sh
x509_user_cert=/home/timur/k5-ca-proxy.pem
x509_user_key=/home/timur/k5-ca-proxy.pem
x509_user_proxy=/tmp/x509up_u3896
x509_user_trusted_certificates=/etc/grid-security/certificates
globus_tcp_port_range=null
gss_expected_name=null
storagetype=permanent
retry_num=20
retry_timeout=10000
wsdl_url=null
use_urlcopy_script=false
connect_to_wsdl=false
delegate=true
full_delegation=true
server_mode=passive
srm_protocol_version=1
request_lifetime=86400
access latency=null
overwrite mode=null
priority=0
from[0]=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
to=file://t3ui01.psi.ch//tmp/derek-srm1-1
Thu Aug 28 14:49:57 CEST 2008: starting SRMGetClient
Thu Aug 28 14:49:57 CEST 2008: In SRMClient ExpectedName: host
Thu Aug 28 14:49:57 CEST 2008: SRMClient(https,srm/managerv1,true)
SRMClientV1 : user credentials are: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dfeich/CN=613756/CN=Derek Feichtinger
SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport()
SRMClientV1 : connecting to srm at httpg://t3se01.psi.ch:8443/srm/managerv1
Thu Aug 28 14:49:58 CEST 2008: connected to server, obtaining proxy
Thu Aug 28 14:49:58 CEST 2008: got proxy of type class org.dcache.srm.client.SRMClientV1
SRMClientV1 : get: surls[0]="srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1"
SRMClientV1 : get: protocols[0]="gsiftp"
SRMClientV1 : get: protocols[1]="dcap"
SRMClientV1 : get: protocols[2]="http"
copy_jobs is empty
Thu Aug 28 14:49:59 CEST 2008: srm returned requestId = -2147470644
Thu Aug 28 14:49:59 CEST 2008: sleeping 4 seconds ...
Thu Aug 28 14:50:03 CEST 2008: FileRequestStatus with SURL=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 is Ready
Thu Aug 28 14:50:03 CEST 2008: received TURL=gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
copy_jobs is not empty
Thu Aug 28 14:50:03 CEST 2008: fileIDs is empty, breaking the loop
copying CopyJob, source = gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 destination = file://t3ui01.psi.ch//tmp/derek-srm1-1
GridftpClient: memory buffer size is set to 131072
GridftpClient: connecting to t3se01.psi.ch on port 2811
GridftpClient: gridFTPClient tcp buffer size is set to 0
GridftpClient: gridFTPRead started
GridftpClient: set local data channel authentication mode to None
GridftpClient: parallelism: 10
GridftpClient: waiting for completion of transfer
GridftpClient: starting a transfer from /pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
at java.lang.Thread.run(Thread.java:595)
GridftpClient: transfer exception
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
at java.lang.Thread.run(Thread.java:595)
GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@1b45e2d5
GridftpClient: closed client
copy failed with the error
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
at java.lang.Thread.run(Thread.java:595)
try again
sleeping for 10000 before retrying
Setup information
Setup:
- t3se01.psi.ch: The dcache head node running the SRM service and also a gridftp door.
- t3fs01.psi.ch: The pool node hosting the file and also running a gridftp door.
- t3ui01.psi.ch: The client machine invoking srmcp.
Observations
Pool cell pinboard
The best information can be obtained from the pool cell of the affected pool:
10.43.10 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.43.10 PoolIoFileMessage delivered to ioFile (method)
10.43.10 Dup Request : none
10.43.10 JOB queued 000200000000000000001778
10.43.10 JOB prepare 000200000000000000001778
10.43.10 Entry for 000200000000000000001778 found and is 000200000000000000001778 51200 si={cms:cms}
10.43.10 JOB run 000200000000000000001778
10.43.10 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.43.10 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=E, host=127.0.0.1:34210, buffer=0, passive=false, parallelism=10
10.43.10 (GFtp_2_nio/000200000000000000001778) Connecting to /127.0.0.1:34210
10.43.10 (GFtp_2_nio/000200000000000000001778) java.net.ConnectException: Connection refused
A copy directly from the pool that works, looks like this
globus-url-copy -dbg gsiftp://t3fs01.psi.ch/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:/tmp/mytest.tst
10.51.12 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.51.12 PoolIoFileMessage delivered to ioFile (method)
10.51.12 Dup Request : none
10.51.12 JOB queued 000200000000000000001778
10.51.12 JOB prepare 000200000000000000001778
10.51.12 Entry for 000200000000000000001778 found and is 000200000000000000001778 51200 si={cms:cms}
10.51.12 JOB run 000200000000000000001778
10.51.12 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.51.12 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=S, host=t3fs01:32916, buffer=0, passive=false, parallelism=5
10.51.12 (GFtp_2_nio/000200000000000000001778) Connecting to t3fs01/192.33.123.41:32916
10.51.12 (GFtp_2_nio/000200000000000000001778) Opened Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12 (GFtp_2_nio/000200000000000000001778) Entering event loop
10.51.12 (GFtp_2_nio/000200000000000000001778) Closing Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12 (GFtp_2_nio/000200000000000000001778) Multiplexer shutting down
10.51.12 (GFtp_2_nio/000200000000000000001778) Left event loop and closing channels
10.51.12 (GFtp_2_nio/000200000000000000001778) Transfer finished: 51200 bytes transferred in 0.09 seconds = 0.568 MB/s
10.51.12 000200000000000000001778;length=51200;timer=Null Time elapsed
10.51.13 IO thread finished : Thread[Scheduler-IO-1-worker,5,t3fs01_cms_1-threads]
In the SRM case, somehow the pool tries to connect to "/127.0.0.1:[port]" instead of to "t3fs01/192.33.123.41:[port]". The hostname is an empty string, it seems. Unclear what is used in the real connect. May also be that the listening socket is not for ALL addresses of the host (including localhost) but only for the named interface.
System call tracing of the pool process
To identify the issue, we attach a truss to the pool process on the pool node:
truss -v connect -f -p 2954 &> /tmp/truss
Looking through the output for both cases, we find that in the case of the SRM mediated copy, the connect indeed is tried to the localhost address, while in the direct gridftp case, the named interface was used:
# SRM mediated copy
2954/82: connect(25, 0xD10B0850, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
2954/82: AF_INET name = 127.0.0.1 port = 34235
# standalone gridftp copy
2954/82: connect(24, 0xD10B0850, 16, SOV_DEFAULT) = 0
2954/82: AF_INET name = 192.33.123.41 port = 32930
A reference installation on the Tier-2 shows that a Solaris-1 node also only listens to the named interface based on the netstat output:
se05# netstat
UDP: IPv4
Local Address Remote Address State
-------------------- -------------------- ----------
se05.32772 mon.8671 Connected
TCP: IPv4
Local Address Remote Address Swind Send-Q Rwind Recv-Q State
-------------------- -------------------- ----- ------ ----- ------ -----------
se05.41053 se09.60559 49640 0 49640 0 TIME_WAIT
se05.24411 wn08.52485 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52486 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52487 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52488 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52489 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52490 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52491 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52492 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52493 5840 0 49232 0 TIME_WAIT
se05.24411 wn08.52494 5840 0 49232 0 TIME_WAIT
se05.ssh ui.59621 9056 0 49232 0 ESTABLISHED
se05.65360 storage01.52323 131072 0 49640 0 ESTABLISHED
se05.65361 storage01.52323 131072 0 49640 0 ESTABLISHED
se05.65362 storage01.52323 130944 0 49640 0 ESTABLISHED
se05.65363 storage01.52323 131072 0 49640 0 ESTABLISHED
Solution
For some stupid reason I had on this client host put the public hostname on the 127.0.0.1 line, instead of the name of the public interface.
So, seemingly the SRM client sends its own resolution for its hostname to the SRM service, which then sends the wrongly resolved address (127.0.0.1) onwards to the pool node, which ends up connecting to itself instead of connecting to the opened data stream socket on the client node.
--
DerekFeichtinger - 28 Aug 2008