Tags:
create new tag
view all tags

IssueDcacheSrmRead

Problem Description

SRMv1 read from the SE fails, even though globus-url-copy of the same file from the SE as well as from the pool node directly works fine.

Reason

The client host wrongly resolves its own IP address.

Example command

srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:////tmp/derek-srm1-1 

Show srmcp debug output Hide

[feichtinger@t3ui01 ~]$ srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file://t3ui01.psi.ch//tmp/derek-srm1-1
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
Storage Resource Manager (SRM) CP Client version 2.0
Copyright (c) 2002-2006 Fermi National Accelerator Laboratory

SRM Configuration:
        debug=true
        gsissl=true
        help=false
        pushmode=false
        userproxy=true
        buffer_size=131072
        tcp_buffer_size=0
        streams_num=10
        config_file=config.xml
        glue_mapfile=conf/SRMServerV1.map
        webservice_path=srm/managerv1
        webservice_protocol=https
        gsiftpclinet=globus-url-copy
        protocols_list=http,gsiftp
        save_config_file=null
        srmcphome=..
        urlcopy=sbin/urlcopy.sh
        x509_user_cert=/home/timur/k5-ca-proxy.pem
        x509_user_key=/home/timur/k5-ca-proxy.pem
        x509_user_proxy=/tmp/x509up_u3896
        x509_user_trusted_certificates=/etc/grid-security/certificates
        globus_tcp_port_range=null
        gss_expected_name=null
        storagetype=permanent
        retry_num=20
        retry_timeout=10000
        wsdl_url=null
        use_urlcopy_script=false
        connect_to_wsdl=false
        delegate=true
        full_delegation=true
        server_mode=passive
        srm_protocol_version=1
        request_lifetime=86400
        access latency=null
        overwrite mode=null
        priority=0
        from[0]=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
        to=file://t3ui01.psi.ch//tmp/derek-srm1-1

Thu Aug 28 14:49:57 CEST 2008: starting SRMGetClient
Thu Aug 28 14:49:57 CEST 2008: In SRMClient ExpectedName: host
Thu Aug 28 14:49:57 CEST 2008: SRMClient(https,srm/managerv1,true)
SRMClientV1 : user credentials are: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dfeich/CN=613756/CN=Derek Feichtinger
SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport()
SRMClientV1 : connecting to srm at httpg://t3se01.psi.ch:8443/srm/managerv1
Thu Aug 28 14:49:58 CEST 2008: connected to server, obtaining proxy
Thu Aug 28 14:49:58 CEST 2008: got proxy of type class org.dcache.srm.client.SRMClientV1
SRMClientV1 :   get: surls[0]="srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1"
SRMClientV1 :   get: protocols[0]="gsiftp"
SRMClientV1 :   get: protocols[1]="dcap"
SRMClientV1 :   get: protocols[2]="http"
copy_jobs is empty
Thu Aug 28 14:49:59 CEST 2008:  srm returned requestId = -2147470644
Thu Aug 28 14:49:59 CEST 2008: sleeping 4 seconds ...
Thu Aug 28 14:50:03 CEST 2008: FileRequestStatus with SURL=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 is Ready
Thu Aug 28 14:50:03 CEST 2008:        received TURL=gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
copy_jobs is not empty
Thu Aug 28 14:50:03 CEST 2008: fileIDs is empty, breaking the loop
copying CopyJob, source = gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 destination = file://t3ui01.psi.ch//tmp/derek-srm1-1
GridftpClient: memory buffer size is set to 131072
GridftpClient: connecting to t3se01.psi.ch on port 2811
GridftpClient: gridFTPClient tcp buffer size is set to 0
GridftpClient: gridFTPRead started
GridftpClient: set local data channel authentication mode to None
GridftpClient: parallelism: 10
GridftpClient: waiting for completion of transfer
GridftpClient: starting a transfer from /pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)].  Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
        at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
        at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
        at java.lang.Thread.run(Thread.java:595)
GridftpClient:  transfer exception
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)].  Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
        at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
        at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
        at java.lang.Thread.run(Thread.java:595)
GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@1b45e2d5
GridftpClient: closed client
copy failed with the error
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)].  Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457)
        at org.globus.ftp.FTPClient.get2(FTPClient.java:1724)
        at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927)
        at java.lang.Thread.run(Thread.java:595)
 try again
sleeping for 10000 before retrying

Setup information

Setup:

  • t3se01.psi.ch: The dcache head node running the SRM service and also a gridftp door.
  • t3fs01.psi.ch: The pool node hosting the file and also running a gridftp door.
  • t3ui01.psi.ch: The client machine invoking srmcp.

Observations

Pool cell pinboard

The best information can be obtained from the pool cell of the affected pool:

10.43.10  Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.43.10  PoolIoFileMessage delivered to ioFile (method)
10.43.10  Dup Request : none 
10.43.10  JOB queued 000200000000000000001778
10.43.10  JOB prepare 000200000000000000001778
10.43.10  Entry for 000200000000000000001778 found and is 000200000000000000001778  51200 si={cms:cms}
10.43.10  JOB run 000200000000000000001778
10.43.10  Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.43.10  (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=E, host=127.0.0.1:34210, buffer=0, passive=false, parallelism=10
10.43.10  (GFtp_2_nio/000200000000000000001778) Connecting to /127.0.0.1:34210
10.43.10  (GFtp_2_nio/000200000000000000001778) java.net.ConnectException: Connection refused

A copy directly from the pool that works, looks like this

globus-url-copy -dbg gsiftp://t3fs01.psi.ch/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:/tmp/mytest.tst

10.51.12  Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false
10.51.12  PoolIoFileMessage delivered to ioFile (method)
10.51.12  Dup Request : none 
10.51.12  JOB queued 000200000000000000001778
10.51.12  JOB prepare 000200000000000000001778
10.51.12  Entry for 000200000000000000001778 found and is 000200000000000000001778  51200 si={cms:cms}
10.51.12  JOB run 000200000000000000001778
10.51.12  Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778
10.51.12  (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=S, host=t3fs01:32916, buffer=0, passive=false, parallelism=5
10.51.12  (GFtp_2_nio/000200000000000000001778) Connecting to t3fs01/192.33.123.41:32916
10.51.12  (GFtp_2_nio/000200000000000000001778) Opened Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12  (GFtp_2_nio/000200000000000000001778) Entering event loop
10.51.12  (GFtp_2_nio/000200000000000000001778) Closing Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917]
10.51.12  (GFtp_2_nio/000200000000000000001778) Multiplexer shutting down
10.51.12  (GFtp_2_nio/000200000000000000001778) Left event loop and closing channels
10.51.12  (GFtp_2_nio/000200000000000000001778) Transfer finished: 51200 bytes transferred in 0.09 seconds = 0.568 MB/s
10.51.12  000200000000000000001778;length=51200;timer=Null Time elapsed
10.51.13  IO thread finished : Thread[Scheduler-IO-1-worker,5,t3fs01_cms_1-threads]

In the SRM case, somehow the pool tries to connect to "/127.0.0.1:[port]" instead of to "t3fs01/192.33.123.41:[port]". The hostname is an empty string, it seems. Unclear what is used in the real connect. May also be that the listening socket is not for ALL addresses of the host (including localhost) but only for the named interface.

System call tracing of the pool process

To identify the issue, we attach a truss to the pool process on the pool node:

truss -v connect -f -p 2954 &> /tmp/truss

Looking through the output for both cases, we find that in the case of the SRM mediated copy, the connect indeed is tried to the localhost address, while in the direct gridftp case, the named interface was used:

# SRM mediated copy
2954/82:        connect(25, 0xD10B0850, 16, SOV_DEFAULT)        Err#146 ECONNREFUSED
2954/82:                AF_INET  name = 127.0.0.1  port = 34235

# standalone gridftp copy
2954/82:        connect(24, 0xD10B0850, 16, SOV_DEFAULT)        = 0
2954/82:                AF_INET  name = 192.33.123.41  port = 32930

Solution

For some stupid reason I had on this client host (t3ui01) put the public hostname on the 127.0.0.1 line, instead of the name of the public interface. So, the client resolved its own public address as 127.0.0.1. The failing connection in the SRM mediated case should have gone from the pool to the client, not to the pool itself. For some weird reason, the SRM client sends its own resolution for its hostname to the SRM service, which then sends this wrongly resolved address (127.0.0.1) onwards to the pool node, which ends up connecting to itself instead of connecting to the opened data stream socket on the client node.

-- DerekFeichtinger - 28 Aug 2008

IssueForm
Affected Service dcache SRM read
Symptom summary SRM client receives no data channel gridftp connection
Reason Understood yes
Solution Exists yes
Obsolete no
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2008-08-29 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback