create new tag
view all tags

Arrow left Go to previous page / next page of CMS site log MOVED TO...

04. 09. 2008 PhEDEx problem with exports of a dataset to FZK

The dataset /QCD_Pt_80_120/CMSSW_1_6_7-JobRobot-1201523639/GEN-SIM-DIGI-RECO failed to get copied for several days from CSCS, DESY and RWTH to FZK.

Original Mail from Armin Scheurer: Show mail text Hide

hi zusammen,

ich hab eure email-adressen in der siteDB als phedex contact gefunden. ich hoffe, das ist alles noch aktuell.

seit einigen tagen scheint es probleme beim transfer vor allem von einem speziellen datensatz ans FZK zu geben und zwar von allen sites. der datensatz wird vom DESY, CSCS und RWTH zum FZK gerouted und expired dann oder stirbt mit einem "agent lost the transfer". bei den expired transfers gibt es ja leider keine logs auf der phedex-seite und dementsprechend würde ich euch bitte, mal eure agent logs nach diesen transfers durchzuschauen. evtl. gibt es ja dort einen hinweis darauf, was schief läuft.

wir versuchen nun schon seit tagen den datensatz zu kopieren. es sind auch nur 24 dateien. aber nichts tut sich. andere transfers von eurer seite allerdings liefen eigentlich ganz ordentlich durch, obwohl es auch bei denen einige "agent lost the transfer"-abbrüche gab.

der datensatz ist folgender:


der wurde "versehentlich" bei einer zentralen CERN löschaktion an allen T1 gelöscht. unglücklicherweise wird er von den SAM/JobRobot tests verwendet.

ich bedanke mich auf jeden fall schon mal im voraus für eure hilfe.



Collecting information


find  dataset where dataset like %QCD_Pt_80_120/CMSSW_1_6_7-JobRobot-1201523639%

Then use the plain link to get the list of files.

Show query result Hide


Are the files ok on dCache?

The Trivial File Catalog rule for our dCache is just a prefix of /pnfs/lcg.cscs.ch/cms/trivcat, so mapping this to local filenames is just

for n in `cat files.lst`; do echo /pnfs/lcg.cscs.ch/cms/trivcat$n; done > pnfs.lst

Using our DcacheShellutils, I can see that the files look ok (will add this part later).

Using the PhEDEx error query tools

Let's look at all transfer errors saved in the central DB for transfers to FZK:

/home/phedex/PHEDEX/Utilities/ErrorSiteQuery --db /home/phedex/config/DBParam.CSCS:Prod/CSCS  --src "%CSCS%" -m 1000 -s "-48 hours"
2008-09-04 09:46:07: ErrorSiteQuery[923]: (re)connecting to database
2008-09-04 09:46:09: ErrorSiteQuery[923]: disconnected from database
Results starting from date 1220348767  Tue Sep  2 11:46:07 2008
Number of results: 100 (of max 1000)

**** from T2_CH_CSCS to T1_DE_FZK_Buffer:
    100   agent lost the transfer

Now we look at how many centers suffered from this error mode in the last 48 hours to get a more general picture:

 /home/phedex/PHEDEX/Utilities/ErrorQuery --db ~/config/DBParam.CSCS:Prod/CSCS -s "-48 hours" -e "%agent lost the transfer%" -x -m 1000 --sort dst
2008-09-04 10:07:13: ErrorQuery[2063]: (re)connecting to database
2008-09-04 10:07:19: ErrorQuery[2063]: disconnected from database
#Number of results: 448 (of max 1000. Primary search retrieved 448)
#count  src                  dst                  backend    stech  dtech  fts                      channel      nfiles
    17  T1_US_FNAL_Buffer    T1_CH_CERN_Buffer    n.a.       11     castor n.a.                     n.a.         n.a.
     1  T1_FR_CCIN2P3_Buffer T1_DE_FZK_Buffer     n.a.       pnfs   pnfs   n.a.                     n.a.         n.a.
   100  T2_CH_CSCS           T1_DE_FZK_Buffer     n.a.       pnfs   pnfs   n.a.                     n.a.         n.a.
   100  T2_DE_DESY           T1_DE_FZK_Buffer     n.a.       pnfs   pnfs   n.a.                     n.a.         n.a.
     1  T2_CN_Beijing        T1_DE_FZK_Buffer     n.a.       pnfs   pnfs   n.a.                     n.a.         n.a.
     2  T1_US_FNAL_Buffer    T1_IT_CNAF_Buffer    n.a.       11     castor n.a.                     n.a.         n.a.
    10  T1_CH_CERN_Buffer    T1_US_FNAL_Buffer    n.a.       castor 11     n.a.                     n.a.         n.a.
    27  T2_DE_RWTH           T1_US_FNAL_Buffer    n.a.       pnfs   11     n.a.                     n.a.         n.a.
     2  T0_CH_CERN_Export    T1_US_FNAL_Buffer    n.a.       castor 11     n.a.                     n.a.         n.a.
    25  T1_US_FNAL_Buffer    T2_CH_CAF            n.a.       11     castor n.a.                     n.a.         n.a.
    59  T1_US_FNAL_Buffer    T2_DE_DESY           n.a.       11     pnfs   n.a.                     n.a.         n.a.
     4  T1_CH_CERN_Buffer    T2_US_Nebraska       n.a.       castor pnfs   n.a.                     n.a.         n.a.
   100  T1_DE_FZK_Buffer     T2_US_Nebraska       n.a.       pnfs   pnfs   n.a.                     n.a.         n.a.

The output of "n.a." in many columns results from the fact that there was no FTS log information found in the central transfer logs, so the FTS status was not queried in these failure modes (or it could mean that FZK is using an old PhEDEx version which does not correctly store this info. This is not the case as can be seen from the PhEDEx components page with the right "show options".


Based on what we get from the Error Query Tools (missing FTS information), it may be a problem with the FTS channels or the FTS server. A person with rights to the relevant channels should investigate whether the service is functioning correctly.

here is also an intersting hypernews discussion on these errors.

-- DerekFeichtinger - 04 Sep 2008

Arrow left Go to previous page / next page of CMS site log MOVED TO...

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2009-09-17 - DerekFeichtinger
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback