FTS (STAR-CSCS) Channel debugging
In August the channel worked fine for some interactive tries. I configured PhEDEx to use it and let it run a few days. The amount of errors was rather
large, so I switched back to srmcp after three days. Statistics and details about these tries (error messages according to sites) can be found here:
An intermittent error condition observed sometimes is the FTS error reason "Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind()". Usually this works with the next try.
Initial FTS channel settings
The initial settings of the channel were:
glite-transfer-channel-list -s $FTSEP/ChannelManagement STAR-CSCS
Channel: STAR-CSCS
Between: * and CSCS-LCG2
State: Active
Contact: (null)
Bandwidth: 0
Nominal throughput: 0
Number of files: 10, streams: 10
Number of VO shares: 4
VO 'dteam' share is: 25
VO 'atlas' share is: 25
VO 'cms' share is: 25
VO 'lhcb' share is: 25
Test Cases
CSCS to CSCS copy
copyjob file contains:
srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/heartbeat/test.transfer
srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-local-copyjob.tst
This transfer cannot be done with
srmcp
because DPM cannot act as a SRM source (see
this link for terminology). This should probably also prevent FTS from working, if it
needs to go via the SRM protocol.
FTS commands return the following diagnostics:
glite-transfer-status -l -s $FTSEP/FileTransfer 16ca2477-4337-11db-a8b9-b90c0f5467b9
Active
Source: srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/heartbeat/test.transfer
Destination: srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-local-copyjob.tst
State: Waiting
Retries: 2
Reason: Getting filesize failed. a system call failed (Connection timed out)
Duration: 0
In the
/var/log/srmv1/log on our SE I see
...
09/13 16:49:21 1460,8 getRequestStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek
Feichtinger from fts-fzk.gridka.de
09/13 16:49:21 1460,8 getRequestStatus: SRM98 - getRequestStatus 594707
09/13 16:49:21 1460,8 getRequestStatus: returns 0
09/13 16:49:21 1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:49:21 1460,8 setFileStatus: SRM98 - setFileStatus 594707 0 Running
09/13 16:49:21 1460,8 setFileStatus: returns 0
09/13 16:49:21 1460,8 ping: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
from fts-fzk.gridka.de
09/13 16:49:21 1460,8 ping: returns 0
09/13 16:49:21 1460,8 put: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
from fts-fzk.gridka.de
09/13 16:49:21 1460,8 put: SRM98 - put 594711 594711
09/13 16:49:21 1460,8 put: SRM98 - put 0 srm://se01-lcg.projects.cscs.ch/dpm/projects.cscs.c
h/home/cms/local_tests/derek-local-copyjob.tst
09/13 16:49:22 1460,8 put: returns 0
09/13 16:49:22 1460,8 getRequestStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek
Feichtinger from fts-fzk.gridka.de
09/13 16:49:22 1460,8 getRequestStatus: SRM98 - getRequestStatus 594711
09/13 16:49:22 1460,8 getRequestStatus: returns 0
...
...
09/13 16:52:35 1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:52:35 1460,8 setFileStatus: SRM98 - setFileStatus 594707 0 Done
09/13 16:52:35 1460,8 setFileStatus: returns 0
09/13 16:52:35 1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:52:35 1460,8 setFileStatus: SRM98 - setFileStatus 594711 0 Done
09/13 16:52:35 1460,8 setFileStatus: dpm_putdone failed: No such file or directory
09/13 16:52:35 1460,8 setFileStatus: returns 12
...
(repeats many times)
FZK to CSCS copy
copyjob file contains:
srm://gridka-dCache.fzk.de:8443/srm/managerv1?SFN=/pnfs/gridka.de/cms/disk-only/heartbeat/test.transfer
srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-fts-FZK-CSCS.tst
Made sure that a copy using
srmcp -pushmode
is reproducibly successful.
Displays exactly same behavior as CSCS-CSCS copy. Same messages from
glite-transfer-status
and in
/var/log/srmv1/log
CSCS to CSCS copy with FTS number of streams changed to 1
This worked at once with several tries by hand. It seems that the number of parallel streams is a parameter with lots of impact.
A lot of thanks to Jiri Chuboda for his ALICE/ATLAS
SC4 pages for the Prague T2, which I found googling around.
So, FTS is somehow able to handle the fact that DPM can not act as a SRM source.
FZK to CSCS copy with FTS number of streams changed to 1
Also this transfer worked at once with the new setting for the number of parallel streams.
PhEDEx test with this setup
In transfers from
IN2P3 I see about 50% failiures with these error messages (leading number indicates number of failures with this specific error):
2 T1_IN2P3_Load: Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
1 T1_IN2P3_Load: Failed Transfer failed due to possible network problem - timed out.;
1 T1_IN2P3_Load: Failed Transfer failed. ERROR an end-of-file was reached;
11 T1_IN2P3_Load: Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
At least the errors seem to be now more uniform and not as in the first PhEDEx test last month.
As can be seen from the following Ganglia graphs, the throuput for the downloads is at a maximum of ca. 20 MB/s. There is a very high degree of symmetric packet exchange involved based on the
packets graph. It's a bit disturbing that we already see quite a bit of CPU_WAIT at this degree of load.
After 1 day (Sep 16):
SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 14 / FAILED: 5) success rate: 73.6842105263158%
site: T1_IN2P3_Load (OK: 311 / FAILED: 155) success rate: 66.7381974248927%
1 T1_IN2P3_Load: Failed Failed on SRM put: SRM getRequestStatus timed out on put;
4 T1_IN2P3_Load: Failed Transfer failed. ERROR an end-of-file was reached;
5 T1_IN2P3_Load: Failed Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind();
6 T1_IN2P3_Load: Failed Transfer failed due to possible network problem - timed out.;
7 T1_IN2P3_Load: Failed Failed on SRM put: SRM getRequestStatus timed out on put; also failing to do 'advisoryDelete' on target.;
28 T1_IN2P3_Load: Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
104 T1_IN2P3_Load: Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
5 T1_CERN_Load: Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
For comparison: PhEDEx via srmcp from 2006-08-21 08:54:09 to 2006-09-12 08:48:24
SITE STATISTICS:
==================
site: T1_RAL_Load (OK: 108 / FAILED: 50) success rate: 68.3544303797468%
site: T1_CERN_Load (OK: 75 / FAILED: 8) success rate: 90.3614457831325%
site: T1_FZK_Load (OK: 187 / FAILED: 53) success rate: 77.9166666666667%
site: T1_CNAF_Load (OK: / FAILED: 5) success rate: 0%
site: T1_IN2P3_Load (OK: 1877 / FAILED: 100) success rate: 94.9418310571573%
Ganglia Plots for the 1 day of FTS testing:
Sep 19: Something went wrong with the myproxy certificate (I was sure that I had prolonged it on Friday...). All transfers till this morning had failed because of it.
After correcting this we were able to get some new statistics, this time including transfers from FZK (2006-09-19 09:45:23 to 2006-09-20 10:09:48)
SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 36 / FAILED: 6) success rate: 85.7142857142857%
site: T1_FZK_Load (OK: 61 / FAILED: 152) success rate: 28.6384976525822%
Errors:
3 T1_CERN_Load: Canceled (null);
3 T1_CERN_Load: Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
1 T1_FZK_Load: Failed Failed on SRM get: Failed SRM get on httpg://gridka-dcache.fzk.de:8443/srm/managerv1 ; Error is;
1 T1_FZK_Load: Failed Failed on SRM get: SRM getRequestStatus timed out on get;
1 T1_FZK_Load: Failed Failed on SRM put: SRM getRequestStatus timed out on put;
1 T1_FZK_Load: Failed Transfer failed. ERROR an end-of-file was reached;
3 T1_FZK_Load: Failed Transfer failed due to possible network problem - timed out.;
6 T1_FZK_Load: Failed SRMGETDONE Operation Timed out.;
10 T1_FZK_Load: Failed Failed on SRM get: Failed To Get SURL. Error in srm__get: SOAP-ENV:Client - CGSI-gSOAP:
Error reading token data: Success;
12 T1_FZK_Load: Failed Failed on SRM put: SRM getRequestStatus timed out on put; also failing to do 'advisoryDelete' on target.;
13 T1_FZK_Load: Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
30 T1_FZK_Load: Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
Error reading token data: Success;
74 T1_FZK_Load: Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
Settings changed to concurrent downloads=2, streams=5
We got some input from Gavin McCance at the SC4 workshop in Munich. We will try some of his suggestions together with FZK in a debugging session tomorrow (Sep 21).
For now, I set the channel to files=2 (concurrentd downloads) and number of parrallel streams per file = 5. Just to test influence on error message frequencies (changed settings Sep 20, 14:45).
Statistics using these settings from 2006-09-20 13:52:16 to 2006-09-21 07:13:04:
SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 15 / FAILED: 2) success rate: 88.2352941176471%
site: T1_FZK_Load (OK: 25 / FAILED: 132) success rate: 15.9235668789809%
Errors:
1 T1_FZK_Load: Canceled Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
Error reading token data: Success;
1 T1_FZK_Load: Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
Error reading token data: Connection reset by peer;
1 T1_FZK_Load: Failed Failed on SRM get: SRM getRequestStatus timed out on get;
1 T1_FZK_Load: Failed SRMGETDONE Operation Timed out.;
1 T1_FZK_Load: Failed Transfer failed. ERROR the server sent an error response: 500 500 'rest 0-209285120': command not understood;
1 T1_FZK_Load: Waiting Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
2 T1_FZK_Load: Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !;
2 T1_FZK_Load: Failed Failed on SRM get: Failed SRM get on httpg://gridka-dcache.fzk.de:8443/srm/managerv1 ; Error is;
2 T1_FZK_Load: Failed Transfer failed. ERROR an end-of-file was reached;
3 T1_FZK_Load: Failed Operation was aborted (the gridFTP transfer timed out).;
4 T1_FZK_Load: Canceled Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
4 T1_FZK_Load: Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ; Error is File exists;
6 T1_FZK_Load: Canceled (null);
7 T1_FZK_Load: Failed Getting filesize failed. a system call failed (Connection refused);
17 T1_FZK_Load: Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success;
75 T1_FZK_Load: Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
--
DerekFeichtinger - 16 Sep 2006