Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+ !!%TOPIC% %TOC% ---++ Problem Description SRMv1 read from the SE fails, even though globus-url-copy of the same file from the SE as well as from the pool node directly works fine. ---++ Reason The client host wrongly resolves its own IP address. ---++ Example command <pre> srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:////tmp/derek-srm1-1 </pre> %TWISTY{showlink="Show srmcp debug output" hidelink="Hide" showimgleft="%ICONURLPATH{toggleopen-small}%" hideimgleft="%ICONURLPATH{toggleclose-small}%"}%<verbatim> [feichtinger@t3ui01 ~]$ srmcp -debug srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file://t3ui01.psi.ch//tmp/derek-srm1-1 WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed WARNING: SRM_PATH=/opt/d-cache/srm Storage Resource Manager (SRM) CP Client version 2.0 Copyright (c) 2002-2006 Fermi National Accelerator Laboratory SRM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=131072 tcp_buffer_size=0 streams_num=10 config_file=config.xml glue_mapfile=conf/SRMServerV1.map webservice_path=srm/managerv1 webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=http,gsiftp save_config_file=null srmcphome=.. urlcopy=sbin/urlcopy.sh x509_user_cert=/home/timur/k5-ca-proxy.pem x509_user_key=/home/timur/k5-ca-proxy.pem x509_user_proxy=/tmp/x509up_u3896 x509_user_trusted_certificates=/etc/grid-security/certificates globus_tcp_port_range=null gss_expected_name=null storagetype=permanent retry_num=20 retry_timeout=10000 wsdl_url=null use_urlcopy_script=false connect_to_wsdl=false delegate=true full_delegation=true server_mode=passive srm_protocol_version=1 request_lifetime=86400 access latency=null overwrite mode=null priority=0 from[0]=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 to=file://t3ui01.psi.ch//tmp/derek-srm1-1 Thu Aug 28 14:49:57 CEST 2008: starting SRMGetClient Thu Aug 28 14:49:57 CEST 2008: In SRMClient ExpectedName: host Thu Aug 28 14:49:57 CEST 2008: SRMClient(https,srm/managerv1,true) SRMClientV1 : user credentials are: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=dfeich/CN=613756/CN=Derek Feichtinger SRMClientV1 : SRMClientV1 calling org.globus.axis.util.Util.registerTransport() SRMClientV1 : connecting to srm at httpg://t3se01.psi.ch:8443/srm/managerv1 Thu Aug 28 14:49:58 CEST 2008: connected to server, obtaining proxy Thu Aug 28 14:49:58 CEST 2008: got proxy of type class org.dcache.srm.client.SRMClientV1 SRMClientV1 : get: surls[0]="srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1" SRMClientV1 : get: protocols[0]="gsiftp" SRMClientV1 : get: protocols[1]="dcap" SRMClientV1 : get: protocols[2]="http" copy_jobs is empty Thu Aug 28 14:49:59 CEST 2008: srm returned requestId = -2147470644 Thu Aug 28 14:49:59 CEST 2008: sleeping 4 seconds ... Thu Aug 28 14:50:03 CEST 2008: FileRequestStatus with SURL=srm://t3se01.psi.ch:8443/srm/managerv1?SFN=/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 is Ready Thu Aug 28 14:50:03 CEST 2008: received TURL=gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 copy_jobs is not empty Thu Aug 28 14:50:03 CEST 2008: fileIDs is empty, breaking the loop copying CopyJob, source = gsiftp://t3se01.psi.ch:2811//pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 destination = file://t3ui01.psi.ch//tmp/derek-srm1-1 GridftpClient: memory buffer size is set to 131072 GridftpClient: connecting to t3se01.psi.ch on port 2811 GridftpClient: gridFTPClient tcp buffer size is set to 0 GridftpClient: gridFTPRead started GridftpClient: set local data channel authentication mode to None GridftpClient: parallelism: 10 GridftpClient: waiting for completion of transfer GridftpClient: starting a transfer from /pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182) at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109) at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457) at org.globus.ftp.FTPClient.get2(FTPClient.java:1724) at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927) at java.lang.Thread.run(Thread.java:595) GridftpClient: transfer exception org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182) at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109) at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457) at org.globus.ftp.FTPClient.get2(FTPClient.java:1724) at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927) at java.lang.Thread.run(Thread.java:595) GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@1b45e2d5 GridftpClient: closed client copy failed with the error org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused)]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted (Unexpected Exception : java.net.ConnectException: Connection refused) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182) at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109) at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1457) at org.globus.ftp.FTPClient.get2(FTPClient.java:1724) at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:927) at java.lang.Thread.run(Thread.java:595) try again sleeping for 10000 before retrying </verbatim> %ENDTWISTY% ---++ Setup information *Setup*: * t3se01.psi.ch: The dcache head node running the SRM service and also a gridftp door. * t3fs01.psi.ch: The pool node hosting the file and also running a gridftp door. * t3ui01.psi.ch: The client machine invoking srmcp. ---++ Observations ---+++ Pool cell pinboard The best information can be obtained from the pool cell of the affected pool: <pre> 10.43.10 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false 10.43.10 PoolIoFileMessage delivered to ioFile (method) 10.43.10 Dup Request : none <GFTP-t3fs01-Unknown-105@gridftp-t3fs01Domain:10002> 10.43.10 JOB queued 000200000000000000001778 10.43.10 JOB prepare 000200000000000000001778 10.43.10 Entry for 000200000000000000001778 found and is 000200000000000000001778 <C-------X--(0)[0]> 51200 si={cms:cms} 10.43.10 JOB run 000200000000000000001778 10.43.10 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778 10.43.10 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=E, host=127.0.0.1:34210, buffer=0, passive=false, parallelism=10 10.43.10 (GFtp_2_nio/000200000000000000001778) Connecting to /127.0.0.1:34210 10.43.10 (GFtp_2_nio/000200000000000000001778) java.net.ConnectException: Connection refused </pre> A copy directly from the pool that works, looks like this <pre> globus-url-copy -dbg gsiftp://t3fs01.psi.ch/pnfs/psi.ch/cms/automatic_test-20080823-0015-2165-srm1 file:/tmp/mytest.tst 10.51.12 Sending reply Pool=t3fs01_cms_1;NOTAGS;;PnfsId=000200000000000000001778;have=true;waiting=false 10.51.12 PoolIoFileMessage delivered to ioFile (method) 10.51.12 Dup Request : none <GFTP-t3fs01-Unknown-106@gridftp-t3fs01Domain:10003> 10.51.12 JOB queued 000200000000000000001778 10.51.12 JOB prepare 000200000000000000001778 10.51.12 Entry for 000200000000000000001778 found and is 000200000000000000001778 <C-------X--(0)[0]> 51200 si={cms:cms} 10.51.12 JOB run 000200000000000000001778 10.51.12 Trying to open /data1/t3fs01_cms_1/data/000200000000000000001778 10.51.12 (GFtp_2_nio/000200000000000000001778) version=1, role=Sender, mode=S, host=t3fs01:32916, buffer=0, passive=false, parallelism=5 10.51.12 (GFtp_2_nio/000200000000000000001778) Connecting to t3fs01/192.33.123.41:32916 10.51.12 (GFtp_2_nio/000200000000000000001778) Opened Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917] 10.51.12 (GFtp_2_nio/000200000000000000001778) Entering event loop 10.51.12 (GFtp_2_nio/000200000000000000001778) Closing Socket[addr=t3fs01/192.33.123.41,port=32916,localport=32917] 10.51.12 (GFtp_2_nio/000200000000000000001778) Multiplexer shutting down 10.51.12 (GFtp_2_nio/000200000000000000001778) Left event loop and closing channels 10.51.12 (GFtp_2_nio/000200000000000000001778) Transfer finished: 51200 bytes transferred in 0.09 seconds = 0.568 MB/s 10.51.12 000200000000000000001778;length=51200;timer=Null Time elapsed 10.51.13 IO thread finished : Thread[Scheduler-IO-1-worker,5,t3fs01_cms_1-threads] </pre> In the SRM case, somehow the pool tries to connect to "/127.0.0.1:[port]" instead of to "t3fs01/192.33.123.41:[port]". The hostname is an empty string, it seems. Unclear what is used in the real connect. May also be that the listening socket is not for ALL addresses of the host (including localhost) but only for the named interface. ---+++ System call tracing of the pool process To identify the issue, we attach a truss to the pool process on the pool node: <pre> truss -v connect -f -p 2954 &> /tmp/truss </pre> Looking through the output for both cases, we find that in the case of the SRM mediated copy, the connect indeed is tried to the localhost address, while in the direct gridftp case, the named interface was used: <pre> # SRM mediated copy 2954/82: connect(25, 0xD10B0850, 16, SOV_DEFAULT) Err#146 ECONNREFUSED 2954/82: AF_INET name = 127.0.0.1 port = 34235 # standalone gridftp copy 2954/82: connect(24, 0xD10B0850, 16, SOV_DEFAULT) = 0 2954/82: AF_INET name = 192.33.123.41 port = 32930 </pre> ---++ Solution For some stupid reason I had on this client host (t3ui01) put the public hostname on the 127.0.0.1 line, instead of the name of the public interface. So, the client resolved its own public address as 127.0.0.1. The failing connection in the SRM mediated case should have gone from the pool to the client, not to the pool itself. For some weird reason, the SRM client sends its own resolution for its hostname to the SRM service, which then sends this wrongly resolved address (127.0.0.1) onwards to the pool node, which ends up connecting to itself instead of connecting to the opened data stream socket on the client node. -- Main.DerekFeichtinger - 28 Aug 2008
IssueForm
Affected Service
dcache SRM read
Symptom summary
SRM client receives no data channel gridftp connection
Reason Understood
yes
Solution Exists
yes
Obsolete
no
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r4 - 2008-08-29
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback