Go to
previous page /
next page of CMS site log
04. 09. 2008 PhEDEx problem with exports of a dataset to FZK
Summary:
The dataset /QCD_Pt_80_120/CMSSW_1_6_7-JobRobot-1201523639/GEN-SIM-DIGI-RECO failed to get copied for several days from CSCS, DESY and RWTH to FZK.
Original Mail from Armin Scheurer:
Show mail text Hide
hi zusammen,
ich hab eure email-adressen in der siteDB als phedex contact gefunden.
ich hoffe, das ist alles noch aktuell.
seit einigen tagen scheint es probleme beim transfer vor allem von einem
speziellen datensatz ans FZK zu geben und zwar von allen sites. der
datensatz wird vom DESY, CSCS und RWTH zum FZK gerouted und expired dann
oder stirbt mit einem "agent lost the transfer". bei den expired
transfers gibt es ja leider keine logs auf der phedex-seite und
dementsprechend würde ich euch bitte, mal eure agent logs nach diesen
transfers durchzuschauen. evtl. gibt es ja dort einen hinweis darauf,
was schief läuft.
wir versuchen nun schon seit tagen den datensatz zu kopieren. es sind
auch nur 24 dateien. aber nichts tut sich. andere transfers von eurer
seite allerdings liefen eigentlich ganz ordentlich durch, obwohl es auch
bei denen einige "agent lost the transfer"-abbrüche gab.
der datensatz ist folgender:
/QCD_Pt_80_120/CMSSW_1_6_7-JobRobot-1201523639/GEN-SIM-DIGI-RECO
der wurde "versehentlich" bei einer zentralen CERN löschaktion an allen
T1 gelöscht. unglücklicherweise wird er von den SAM/JobRobot tests
verwendet.
ich bedanke mich auf jeden fall schon mal im voraus für eure hilfe.
gruss,
armin
Collecting information
DBS
find dataset where dataset like %QCD_Pt_80_120/CMSSW_1_6_7-JobRobot-1201523639%
Then use the
plain link to get the list of files.
Show query result Hide
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/027ED940-C8CD-DC11-A65C-000423D94D68.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/1872ED1D-C8CD-DC11-ACC1-000423D999AA.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/2A16A572-6DCF-DC11-BBB9-001617C3B6EC.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/2A1BB053-6DCF-DC11-A0CF-000423D655A2.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/3ED30F5F-6DCF-DC11-A28C-001617C3B5F6.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/4264615F-6DCF-DC11-83EC-000423D98658.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/44566191-CBCD-DC11-95A1-001617C3B71A.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/4ADE1A5C-C8CD-DC11-B683-001617C3B708.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/526B2C6A-C8CD-DC11-8E2A-000423D6A77C.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/5AD6F488-C8CD-DC11-8F65-000423D64922.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/703EFD48-C8CD-DC11-9E29-000423DCF0D8.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/7441D9A6-C8CD-DC11-839D-000423D6B1CC.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/864C305F-6DCF-DC11-B854-000423D30AF2.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/887A7DCE-C8CD-DC11-A7F5-003048563216.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/924E1A4B-C7CD-DC11-9289-000423D94E48.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/A8AF3E13-C8CD-DC11-A639-000423D6B1CC.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/AA0D6424-C8CD-DC11-B049-001617DBCF46.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/AA25E1AC-C8CD-DC11-9C0F-000423D998E6.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/C6F319C6-C8CD-DC11-967A-000423D94D68.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/CC5ABBA8-C8CD-DC11-9311-000423DCF0D8.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/D2DDC75E-C8CD-DC11-A9E2-001617DBCF94.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/D80D8D5E-C8CD-DC11-B558-000423D986B0.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/E8C4470A-C8CD-DC11-BA7C-001617E30D4C.root
/store/mc/2008/1/28/JobRobot-QCD_Pt_80_120-1201523639/0034/EAC224A4-18CF-DC11-9445-0030487CF434.root
Are the files ok on dCache?
The Trivial File Catalog rule for our dCache is just a prefix of
/pnfs/lcg.cscs.ch/cms/trivcat
, so mapping this to local filenames is just
for n in `cat files.lst`; do echo /pnfs/lcg.cscs.ch/cms/trivcat$n; done > pnfs.lst
Using our
DcacheShellutils, I can see that the files look ok (will add this part later).
Using the PhEDEx error query tools
Let's look at all transfer errors saved in the central DB for transfers to FZK:
/home/phedex/PHEDEX/Utilities/ErrorSiteQuery --db /home/phedex/config/DBParam.CSCS:Prod/CSCS --src "%CSCS%" -m 1000 -s "-48 hours"
2008-09-04 09:46:07: ErrorSiteQuery[923]: (re)connecting to database
2008-09-04 09:46:09: ErrorSiteQuery[923]: disconnected from database
Results starting from date 1220348767 Tue Sep 2 11:46:07 2008
Number of results: 100 (of max 1000)
**** from T2_CH_CSCS to T1_DE_FZK_Buffer:
100 agent lost the transfer
Now we look at how many centers suffered from this error mode in the last 48 hours to get a more general picture:
/home/phedex/PHEDEX/Utilities/ErrorQuery --db ~/config/DBParam.CSCS:Prod/CSCS -s "-48 hours" -e "%agent lost the transfer%" -x -m 1000 --sort dst
2008-09-04 10:07:13: ErrorQuery[2063]: (re)connecting to database
2008-09-04 10:07:19: ErrorQuery[2063]: disconnected from database
#Number of results: 448 (of max 1000. Primary search retrieved 448)
#
#count src dst backend stech dtech fts channel nfiles
17 T1_US_FNAL_Buffer T1_CH_CERN_Buffer n.a. 11 castor n.a. n.a. n.a.
1 T1_FR_CCIN2P3_Buffer T1_DE_FZK_Buffer n.a. pnfs pnfs n.a. n.a. n.a.
100 T2_CH_CSCS T1_DE_FZK_Buffer n.a. pnfs pnfs n.a. n.a. n.a.
100 T2_DE_DESY T1_DE_FZK_Buffer n.a. pnfs pnfs n.a. n.a. n.a.
1 T2_CN_Beijing T1_DE_FZK_Buffer n.a. pnfs pnfs n.a. n.a. n.a.
2 T1_US_FNAL_Buffer T1_IT_CNAF_Buffer n.a. 11 castor n.a. n.a. n.a.
10 T1_CH_CERN_Buffer T1_US_FNAL_Buffer n.a. castor 11 n.a. n.a. n.a.
27 T2_DE_RWTH T1_US_FNAL_Buffer n.a. pnfs 11 n.a. n.a. n.a.
2 T0_CH_CERN_Export T1_US_FNAL_Buffer n.a. castor 11 n.a. n.a. n.a.
25 T1_US_FNAL_Buffer T2_CH_CAF n.a. 11 castor n.a. n.a. n.a.
59 T1_US_FNAL_Buffer T2_DE_DESY n.a. 11 pnfs n.a. n.a. n.a.
4 T1_CH_CERN_Buffer T2_US_Nebraska n.a. castor pnfs n.a. n.a. n.a.
100 T1_DE_FZK_Buffer T2_US_Nebraska n.a. pnfs pnfs n.a. n.a. n.a.
The output of "n.a." in many columns results from the fact that there was no FTS log information found in the central transfer logs, so the FTS status was not queried in these failure modes (or it could mean that FZK is using an old
PhEDEx version which does not correctly store this info. This is not the case as can be seen from the
PhEDEx components page with the right "show options".
Conclusions
Based on what we get from the Error Query Tools (missing FTS information), it may be a problem with the FTS channels or the FTS server. A person with rights to the relevant channels should investigate whether the service is functioning correctly.
here is also an intersting hypernews discussion on these errors.
--
DerekFeichtinger - 04 Sep 2008
Go to
previous page /
next page of CMS site log