Symptoms
Summary: SRM server returns wrong TURL to client
Case 1
Observations
srmcp fails because the clients receives a wrong TURL
srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1 file:////tmp/dcachetest-20080828-2242-8051/test-srmcp
TEST: SRMv1-read
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
Storage Resource Manager (SRM) CP Client version 2.0
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory
SRM Configuration:
debug=true
gsissl=true
help=false
pushmode=false
userproxy=true
buffer_size=131072
tcp_buffer_size=0
streams_num=10
config_file=config.xml
glue_mapfile=conf/SRMServerV1.map
webservice_path=srm/managerv1
webservice_protocol=https
gsiftpclinet=globus-url-copy
protocols_list=http,gsiftp
save_config_file=null
srmcphome=..
urlcopy=sbin/urlcopy.sh
x509_user_cert=/home/timur/k5-ca-proxy.pem
x509_user_key=/home/timur/k5-ca-proxy.pem
x509_user_proxy=/tmp/x509up_u3896
x509_user_trusted_certificates=/etc/grid-security/certificates
globus_tcp_port_range=null
gss_expected_name=null
storagetype=permanent
retry_num=20
retry_timeout=10000
wsdl_url=null
use_urlcopy_script=false
connect_to_wsdl=false
delegate=true
full_delegation=true
server_mode=passive
srm_protocol_version=1
request_lifetime=86400
access latency=null
overwrite mode=null
priority=0
from[0]=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1
to=file:////tmp/dcachetest-20080828-2242-8051/test-srmcp
Thu Aug 28 22:43:13 CEST 2008: starting SRMGetClient
Thu Aug 28 22:43:13 CEST 2008: In SRMClient ExpectedName: host
Thu Aug 28 22:43:13 CEST 2008: SRMClient(https,srm/managerv1,true)
SRMClientV1 : user credentials are: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dfeich/CN=613756/CN=Derek Feichtinger
SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport()
SRMClientV1 : connecting to srm at httpg://t3se01.psi.ch:8443/srm/managerv1
Thu Aug 28 22:43:14 CEST 2008: connected to server, obtaining proxy
Thu Aug 28 22:43:14 CEST 2008: got proxy of type class org.dcache.srm.client.SRMClientV1
SRMClientV1 : get: surls[0]="srm://t3se01.psi.ch:8443/srm/managerv1?SFN=//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1"
SRMClientV1 : get: protocols[0]="gsiftp"
SRMClientV1 : get: protocols[1]="dcap"
SRMClientV1 : get: protocols[2]="http"
copy_jobs is empty
Thu Aug 28 22:43:15 CEST 2008: srm returned requestId = -2147470564
Thu Aug 28 22:43:15 CEST 2008: sleeping 4 seconds ...
Thu Aug 28 22:43:19 CEST 2008: FileRequestStatus with SURL=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1 is Ready
Thu Aug 28 22:43:19 CEST 2008: received TURL=gsiftp://0.0.0.0:2811//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1
Thu Aug 28 22:43:19 CEST 2008: fileIDs is empty, breaking the loop
copy_jobs is not empty
copying CopyJob, source = gsiftp://0.0.0.0:2811//pnfs/psi.ch/cms/automatic_test-20080828-2242-8051-srm1 destination = file:////tmp/dcachetest-20080828-2242-8051/test-srmcp
GridftpClient: memory buffer size is set to 131072
GridftpClient: connecting to 0.0.0.0 on port 2811
copy failed with the error
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:193)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:520)
at java.net.Socket.connect(Socket.java:470)
at java.net.Socket.<init>(Socket.java:367)
at java.net.Socket.<init>(Socket.java:267)
at org.globus.net.SocketFactory.createSocket(SocketFactory.java:74)
at org.globus.net.SocketFactory.createSocket(SocketFactory.java:53)
at org.globus.ftp.vanilla.FTPControlChannel.open(FTPControlChannel.java:135)
at org.globus.ftp.GridFTPClient.<init>(GridFTPClient.java:74)
at org.dcache.srm.util.GridftpClient$FnalGridFTPClient.<init>(GridftpClient.java:1080)
at org.dcache.srm.util.GridftpClient.<init>(GridftpClient.java:212)
at gov.fnal.srm.util.Copier.javaGridFtpCopy(Copier.java:595)
at gov.fnal.srm.util.Copier.copy(Copier.java:495)
at gov.fnal.srm.util.Copier.run(Copier.java:321)
at java.lang.Thread.run(Thread.java:595)
try again
sleeping for 10000 before retrying
Reason
The
OpenSolaris pool node still had a network configurated automatically by the
nwam service (svc:/network/physical:nwam). e1000g0 had been configured correctly to the public interface by dhcp, but the other interfaces had been assigned 0.0.0.0 addresses. The dcache pool seemingly signed up with one of these other addresses to the head node, and the SRM server returned a TURL for a nonexistent gridftp door.
There should be a way of explicitely specifying the interface with which a pool wants to be signed up, or the bloody thing should make a more intelligent guess as to the default (use the interface you use to contact the head node.....).
-bash-3.2# ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
e1000g0: flags=201004843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,IPv4,CoS> mtu 1500 index 2
inet 192.33.123.42 netmask ffffff00 broadcast 192.33.123.255
ether 0:14:4f:a6:d1:f0
e1000g1: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 3
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
ether 0:14:4f:a6:d1:f1
e1000g2: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 4
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
ether 0:14:4f:a6:d1:f2
e1000g3: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 5
inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
ether 0:14:4f:a6:d1:f3
lo0: flags=2002000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv6,VIRTUAL> mtu 8252 index 1
inet6 ::1/128
Solution
Remove the unwanted interfaces, or as in our case, make one single aggregated interface from them.
Case2
Observation
Client receives a TURL which does not contain the a fully qualified host name, but a local name. The TURL will look like
lcg-cp --connect-timeout 10 --sendreceive-timeout 120 --srm-timeout 180 -b --vo ops -D srmv2 -U srmv2 -v file:/home/samops/.same/SRMv2/testFile.txt 'srm://t3se01.psi.ch:8443/srm/managerv2?SFN=/pnfs/psi.ch/ops/testfile-cp-20091119-134821-b532e7e7c782.txt'
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: ops
Checksum type: None
Destination SE type: SRMv2
Destination SRM Request Token: -2146844058
Source URL: file:/home/samops/.same/SRMv2/testFile.txt
File size: 41472
Source URL for copy: file:/home/samops/.same/SRMv2/testFile.txt
Destination URL: gsiftp://t3fs05:2811//pnfs/psi.ch/ops/testfile-cp-20091119-134821-b532e7e7c782.txt
# streams: 1
file:/home/samops/.same/SRMv2/testFile.txt: globus_xio: Unable to connect to t3fs05:2811
globus_xio: globus_libc_getaddrinfo failed.
globus_common: Name or service not known
lcg_cp: Communication error on send
+ retcode=1
+ set +x
Thu Nov 19 12:48:28 UTC 2009 [1258634908]
Reason
The hostname on the Solaris file server was taken from the dhcp configured
/etc/hosts
file, which only contained the short local name.
Solution
Either delete the line in /etc/hosts, put the FQDN there, or reconfigure DHCP to correctly deliver the full name.
--
DerekFeichtinger - 29 Aug 2008