Tags:
create new tag
view all tags

FTS (STAR-CSCS) Channel debugging

In August the channel worked fine for some interactive tries. I configured PhEDEx to use it and let it run a few days. The amount of errors was rather large, so I switched back to srmcp after three days. Statistics and details about these tries (error messages according to sites) can be found here:

An intermittent error condition observed sometimes is the FTS error reason "Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind()". Usually this works with the next try.

Initial FTS channel settings

The initial settings of the channel were:

glite-transfer-channel-list -s $FTSEP/ChannelManagement STAR-CSCS

Channel: STAR-CSCS
Between: * and CSCS-LCG2
State: Active
Contact: (null)
Bandwidth: 0
Nominal throughput: 0
Number of files: 10, streams: 10
Number of VO shares: 4
VO 'dteam' share is: 25
VO 'atlas' share is: 25
VO 'cms' share is: 25
VO 'lhcb' share is: 25

Test Cases

CSCS to CSCS copy No

copyjob file contains:

srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/heartbeat/test.transfer
srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-local-copyjob.tst

This transfer cannot be done with srmcp because DPM cannot act as a SRM source (see this link for terminology). This should probably also prevent FTS from working, if it needs to go via the SRM protocol.

FTS commands return the following diagnostics:

glite-transfer-status -l -s $FTSEP/FileTransfer 16ca2477-4337-11db-a8b9-b90c0f5467b9

Active
  Source:      srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/heartbeat/test.transfer
  Destination: srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-local-copyjob.tst
  State:       Waiting
  Retries:     2
  Reason:       Getting filesize failed. a system call failed (Connection timed out)
  Duration:    0

In the /var/log/srmv1/log on our SE I see

...
09/13 16:49:21  1460,8 getRequestStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek
 Feichtinger from fts-fzk.gridka.de
09/13 16:49:21  1460,8 getRequestStatus: SRM98 - getRequestStatus 594707
09/13 16:49:21  1460,8 getRequestStatus: returns 0
09/13 16:49:21  1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:49:21  1460,8 setFileStatus: SRM98 - setFileStatus 594707 0 Running
09/13 16:49:21  1460,8 setFileStatus: returns 0
09/13 16:49:21  1460,8 ping: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
 from fts-fzk.gridka.de
09/13 16:49:21  1460,8 ping: returns 0
09/13 16:49:21  1460,8 put: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Feichtinger
from fts-fzk.gridka.de
09/13 16:49:21  1460,8 put: SRM98 - put 594711 594711
09/13 16:49:21  1460,8 put: SRM98 - put 0 srm://se01-lcg.projects.cscs.ch/dpm/projects.cscs.c
h/home/cms/local_tests/derek-local-copyjob.tst
09/13 16:49:22  1460,8 put: returns 0
09/13 16:49:22  1460,8 getRequestStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek
 Feichtinger from fts-fzk.gridka.de
09/13 16:49:22  1460,8 getRequestStatus: SRM98 - getRequestStatus 594711
09/13 16:49:22  1460,8 getRequestStatus: returns 0
...
...
09/13 16:52:35  1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:52:35  1460,8 setFileStatus: SRM98 - setFileStatus 594707 0 Done
09/13 16:52:35  1460,8 setFileStatus: returns 0
09/13 16:52:35  1460,8 setFileStatus: request by /O=GRID-FR/C=CH/O=CSCS/OU=CC-LCG/CN=Derek Fe
ichtinger from fts-fzk.gridka.de
09/13 16:52:35  1460,8 setFileStatus: SRM98 - setFileStatus 594711 0 Done
09/13 16:52:35  1460,8 setFileStatus: dpm_putdone failed: No such file or directory
09/13 16:52:35  1460,8 setFileStatus: returns 12
...
(repeats many times)

FZK to CSCS copy No

copyjob file contains:

srm://gridka-dCache.fzk.de:8443/srm/managerv1?SFN=/pnfs/gridka.de/cms/disk-only/heartbeat/test.transfer
srm://se01-lcg.projects.cscs.ch:8443/srm/managerv1?SFN=/dpm/projects.cscs.ch/home/cms/local_tests/derek-fts-FZK-CSCS.tst

Made sure that a copy using srmcp -pushmode is reproducibly successful.

Displays exactly same behavior as CSCS-CSCS copy. Same messages from glite-transfer-status and in /var/log/srmv1/log

CSCS to CSCS copy with FTS number of streams changed to 1 Yes / Done

This worked at once with several tries by hand. It seems that the number of parallel streams is a parameter with lots of impact.

A lot of thanks to Jiri Chuboda for his ALICE/ATLAS SC4 pages for the Prague T2, which I found googling around.

So, FTS is somehow able to handle the fact that DPM can not act as a SRM source.

FZK to CSCS copy with FTS number of streams changed to 1 Yes / Done

Also this transfer worked at once with the new setting for the number of parallel streams.

PhEDEx test with this setup

In transfers from IN2P3 I see about 50% failiures with these error messages (leading number indicates number of failures with this specific error):

      2 T1_IN2P3_Load:  Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ;  Error is File exists;
      1 T1_IN2P3_Load:  Failed Transfer failed due to possible network problem - timed out.;
      1 T1_IN2P3_Load:  Failed Transfer failed. ERROR an end-of-file was reached;
     11 T1_IN2P3_Load:  Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
At least the errors seem to be now more uniform and not as in the first PhEDEx test last month.

As can be seen from the following Ganglia graphs, the throuput for the downloads is at a maximum of ca. 20 MB/s. There is a very high degree of symmetric packet exchange involved based on the packets graph. It's a bit disturbing that we already see quite a bit of CPU_WAIT at this degree of load.


network-20060915.gif
packets-20060915.gif
cpu-20060915.gif

After 1 day (Sep 16):

SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 14 / FAILED: 5)   success rate: 73.6842105263158%
site: T1_IN2P3_Load (OK: 311 / FAILED: 155)   success rate: 66.7381974248927%

      1 T1_IN2P3_Load:  Failed Failed on SRM put: SRM getRequestStatus timed out on put;
      4 T1_IN2P3_Load:  Failed Transfer failed. ERROR an end-of-file was reached;
      5 T1_IN2P3_Load:  Failed Failed to get proxy certificate from myproxy-fts.cern.ch . Reason is Error in bind();
      6 T1_IN2P3_Load:  Failed Transfer failed due to possible network problem - timed out.;
      7 T1_IN2P3_Load:  Failed Failed on SRM put: SRM getRequestStatus timed out on put; also failing to do 'advisoryDelete' on target.;
     28 T1_IN2P3_Load:  Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ;  Error is File exists;
    104 T1_IN2P3_Load:  Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
      5 T1_CERN_Load:  Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;

For comparison: PhEDEx via srmcp from 2006-08-21 08:54:09 to 2006-09-12 08:48:24

SITE STATISTICS:
==================
site: T1_RAL_Load (OK: 108 / FAILED: 50)   success rate: 68.3544303797468%
site: T1_CERN_Load (OK: 75 / FAILED: 8)   success rate: 90.3614457831325%
site: T1_FZK_Load (OK: 187 / FAILED: 53)   success rate: 77.9166666666667%
site: T1_CNAF_Load (OK:  / FAILED: 5)   success rate: 0%
site: T1_IN2P3_Load (OK: 1877 / FAILED: 100)   success rate: 94.9418310571573%

Ganglia Plots for the 1 day of FTS testing:


cpu-20060916.gif
network-20060916.gif
packets-20060916.gif

Sep 19: Something went wrong with the myproxy certificate (I was sure that I had prolonged it on Friday...). All transfers till this morning had failed because of it. After correcting this we were able to get some new statistics, this time including transfers from FZK (2006-09-19 09:45:23 to 2006-09-20 10:09:48)

SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 36 / FAILED: 6)   success rate: 85.7142857142857%
site: T1_FZK_Load (OK: 61 / FAILED: 152)   success rate: 28.6384976525822%

Errors:
      3 T1_CERN_Load:  Canceled (null);
      3 T1_CERN_Load:  Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ;  Error is File exists;

      1 T1_FZK_Load:  Failed Failed on SRM get: Failed SRM get on httpg://gridka-dcache.fzk.de:8443/srm/managerv1 ;  Error is;
      1 T1_FZK_Load:  Failed Failed on SRM get: SRM getRequestStatus timed out on get;
      1 T1_FZK_Load:  Failed Failed on SRM put: SRM getRequestStatus timed out on put;
      1 T1_FZK_Load:  Failed Transfer failed. ERROR an end-of-file was reached;
      3 T1_FZK_Load:  Failed Transfer failed due to possible network problem - timed out.;
      6 T1_FZK_Load:  Failed SRMGETDONE Operation Timed out.;
     10 T1_FZK_Load:  Failed Failed on SRM get: Failed To Get SURL. Error in srm__get: SOAP-ENV:Client - CGSI-gSOAP: 
            Error reading token data: Success;
     12 T1_FZK_Load:  Failed Failed on SRM put: SRM getRequestStatus timed out on put; also failing to do 'advisoryDelete' on target.;
     13 T1_FZK_Load:  Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ;  Error is File exists;
     30 T1_FZK_Load:  Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
            Error reading token data: Success;
     74 T1_FZK_Load:  Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;

Settings changed to concurrent downloads=2, streams=5

We got some input from Gavin McCance at the SC4 workshop in Munich. We will try some of his suggestions together with FZK in a debugging session tomorrow (Sep 21). For now, I set the channel to files=2 (concurrentd downloads) and number of parrallel streams per file = 5. Just to test influence on error message frequencies (changed settings Sep 20, 14:45).

Statistics using these settings from 2006-09-20 13:52:16 to 2006-09-21 07:13:04:

SITE STATISTICS:
==================
site: T1_CERN_Load (OK: 15 / FAILED: 2)   success rate: 88.2352941176471%
site: T1_FZK_Load (OK: 25 / FAILED: 132)   success rate: 15.9235668789809%

Errors:
      1 T1_FZK_Load:  Canceled Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
            Error reading token data: Success;
      1 T1_FZK_Load:  Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP:
            Error reading token data: Connection reset by peer;
      1 T1_FZK_Load:  Failed Failed on SRM get: SRM getRequestStatus timed out on get;
      1 T1_FZK_Load:  Failed SRMGETDONE Operation Timed out.;
      1 T1_FZK_Load:  Failed Transfer failed. ERROR the server sent an error response: 500 500 'rest 0-209285120': command not understood;
      1 T1_FZK_Load:  Waiting Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
      2 T1_FZK_Load:  Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !;
      2 T1_FZK_Load:  Failed Failed on SRM get: Failed SRM get on httpg://gridka-dcache.fzk.de:8443/srm/managerv1 ;  Error is;
      2 T1_FZK_Load:  Failed Transfer failed. ERROR an end-of-file was reached;
      3 T1_FZK_Load:  Failed Operation was aborted (the gridFTP transfer timed out).;
      4 T1_FZK_Load:  Canceled Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;
      4 T1_FZK_Load:  Failed Failed on SRM put: Failed SRM put on httpg://se01-lcg.projects.cscs.ch:8443/srm/managerv1 ;  Error is File exists;
      6 T1_FZK_Load:  Canceled (null);
      7 T1_FZK_Load:  Failed Getting filesize failed. a system call failed (Connection refused);
     17 T1_FZK_Load:  Failed Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success;
     75 T1_FZK_Load:  Failed Transfer failed. ERROR the server sent an error response: 425 425 Can't open data connection. timed out() failed.;

-- DerekFeichtinger - 16 Sep 2006

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif cpu-20060915.gif r2 r1 manage 13.5 K 2006-09-15 - 15:23 DerekFeichtinger  
GIFgif cpu-20060916.gif r1 manage 15.1 K 2006-09-16 - 14:58 DerekFeichtinger  
GIFgif network-20060915.gif r2 r1 manage 12.1 K 2006-09-15 - 15:23 DerekFeichtinger  
GIFgif network-20060916.gif r1 manage 14.3 K 2006-09-16 - 14:58 DerekFeichtinger  
GIFgif packets-20060915.gif r2 r1 manage 11.2 K 2006-09-15 - 15:22 DerekFeichtinger  
GIFgif packets-20060916.gif r1 manage 14.3 K 2006-09-16 - 14:59 DerekFeichtinger  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2007-02-08 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback