Symptoms
Summary: User cannot retrieve certain files which clearly exist on our fileservers and also are correctly listed in the pnfs namespace
Occurrences
At what times did this problem occur (used to estimate frequency):
Observations
NOTE: We are running dcache golden release 1.9.5-16 with pnfs-postgresql-3.1.10-7 and postgresql-8.3.3
srmcp and dccp failures
Failure of an srmcp points to underlying reason
srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root file:////tmp/derek-srmcp1
...
Fri Jun 17 12:08:47 CEST 2011: srmPrepareToGet update failed, status : SRM_FAILURE explanation= at Fri Jun 17 12:08:47 CEST 2011 state Failed : at Fri Jun 17 12:08:42 CEST 2011 state Pending : created
RequestFileStatus#-2141693797 failed with error:[ at Fri Jun 17 12:08:43 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)]
Fri Jun 17 12:08:47 CEST 2011: GetFileRequest[srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root] status=SRM_INVALID_PATH explanation=Failed
Fri Jun 17 12:08:47 CEST 2011: java.io.IOException: srmPrepareToGet update failed, status : SRM_FAILURE explanation= at Fri Jun 17 12:08:47 CEST 2011 state Failed : at Fri Jun 17 12:08:42 CEST 2011 state Pending : created
RequestFileStatus#-2141693797 failed with error:[ at Fri Jun 17 12:08:43 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)]
Failure of dccp contains less useful output
dccp -d 15 -p 20000:25000 dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root /tmp/derek-dccp
...
Sending control message: 2 0 client open "dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root" r t3ui03.psi.ch 20000 -timeout=-1 -onerror=default -uid=3896
Server error message for [2]: "No such file or directory" (err code: 2, errno: (null)).
dcap_pool: POLLIN on control line [3] id=1
Polling data for destination[4] queueID[2].
[-1] unpluging node
Removing unneeded queue [2]
[-1] destroing node
Failed open file in the dCache.
Can't open source file : "No such file or directory"
System error: Input/output error
Files have missing OSM info. Storageinfoof command fails
Listing of the storageinfo in the adminshell fails.
echo /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root|dc_get_ID_from_pnfsnamelist.sh |dc_get_storageinfo_from_IDlist.sh
0002000000000000027647F0 storageinfoof failed : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)
Note, that we can get other info on the file and that we can also confirm that the local file on the pool exists!
echo /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root|dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh
0002000000000000027647F0 t3fs09_cms_1
Missing pnfs tags for directories
All files in the directory
/pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6
I tested, failed to copy with the identical error. It turns out that the directory seems to have a corrupted tag for the OSMTemplate value. All other tags which should have been inherited also cannot be read:
[root@t3se01 store]# dir=/pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(OSMTemplate): No such file or directory
.(tag)(OSMTemplate)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(sGroup): No such file or directory
.(tag)(sGroup)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(AccessLatency): No such file or directory
.(tag)(AccessLatency)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(RetentionPolicy): No such file or directory
.(tag)(RetentionPolicy)=
Testing the directory above which is ok shows the correct values:
dir=/pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
.(tag)(OSMTemplate)=StoreName cms
.(tag)(sGroup)=cms
.(tag)(AccessLatency)=ONLINE
.(tag)(RetentionPolicy)=OUTPUT
I ran a find over the complete namespace to locate all corrupted directories
for dir in $(find . -type d); do info=$(ls -ld $dir); echo -n "$info "; cat $dir/'.(tag)(OSMTemplate)'; done &> ~/OSMTemplate-corrupt-dirs-20110617-lst
Out of 13802 directories, the following 40 were found with erroneous OSMTemplate tag:
grep -v StoreName ~/OSMTemplate-corrupt-dirs-20110617-lst | sed -e 's#cat: \(.*\)/\..*#\1#' | xargs -i ls -ld {}
drwxrwxr-x 1 cmsuser cms 512 Apr 14 16:04 ./user/pablom/TnP/ElectronsNew/data/mt2/JOB_PD_EG_PromptReco_Run2010B_TnP_ELE_JZB_ANTIPANIC_ANTIMULTITRG_DATA
drwxrwxr-x 1 cmsuser cms 512 Apr 14 15:51 ./user/pablom/TnP/ElectronsNew/data/mt2/JOB_PD_EG_17thSepReReco_Run2010A_TnP_ELE_JZB_ANTIPANIC_ANTIMULTITRG_DATA
drwxrwxr-x 1 cmsuser cms 512 Dec 16 2010 ./user/bortigno/WJets/FASTSIM_bug
drwxrwxr-x 1 cmsuser cms 512 Apr 21 20:22 ./user/susy/trash-me
drwxrwxr-x 1 cmsuser cms 512 Apr 15 15:02 ./user/susy/trash-me/QCD_TuneD6T_HT-500To1000_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:46 ./user/susy/trash-me/GJets_TuneD6T_HT-200_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:45 ./user/susy/trash-me/TToBLNu_TuneZ2_tW-channel_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:44 ./user/susy/trash-me/GJets_TuneD6T_HT-100To200_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:42 ./user/susy/trash-me/DYJetsToLL_TuneZ2_M-50_7TeV-madgraph-tauola
drwxrwxr-x 1 cmsuser cms 512 Apr 19 07:26 ./user/susy/trash-me/VVJetsTo4L_TuneD6T_7TeV-madgraph-tauola
drwxrwxr-x 1 cmsuser cms 512 Apr 15 15:19 ./user/susy/trash-me/LM12_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM13_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 19 07:24 ./user/susy/trash-me/TToBLNu_TuneZ2_s-channel_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:37 ./user/susy/trash-me/GJets_TuneD6T_HT-40To100_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM3_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 20 00:36 ./user/susy/trash-me/TToBLNu_TuneZ2_t-channel_7TeV-madgraph
drwxrwxr-x 1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM4_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 Mar 22 16:45 ./user/susy/ntuples/mc/V01-11-02/GMSBLongLivedNeutralino_CTau-2000_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 Mar 19 17:46 ./user/susy/ntuples/mc/V01-11-02/WWtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x 1 cmsuser cms 512 Mar 19 23:51 ./user/susy/ntuples/mc/V01-11-02/WZtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x 1 cmsuser cms 512 Mar 20 00:30 ./user/susy/ntuples/mc/V01-11-02/ZZtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x 1 cmsuser cms 512 Mar 19 19:44 ./user/susy/ntuples/mc/V01-11-02/WWTo2L2Nu_TuneZ2_7TeV-pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 12:07 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_5to15_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 20:13 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_120to170_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 12:06 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 15:13 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_470to600_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 02:17 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_30to50_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 15:27 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_80to120_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 23:45 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_600to800_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 2 02:57 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_800to1000_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 05:10 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_50to80_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 01:18 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1400to1800_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 30 22:42 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1000to1400_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 30 23:12 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_170to300_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 30 15:28 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_15to30_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 May 1 03:32 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1800_TuneZ2_7TeV_pythia6
drwxrwxr-x 1 cmsuser cms 512 Apr 30 13:39 ./user/susy/ntuples/data/V02-01-01/DoubleMu-Run2011A-PromptReco-v2-AOD
drwxrwxr-x 1 cmsuser cms 512 Apr 30 15:38 ./user/susy/ntuples/data/V02-01-01/MuEG-Run2011A-PromptReco-v1-AOD
drwxrwxr-x 1 cmsuser cms 512 Apr 30 15:15 ./user/susy/ntuples/data/V02-01-01/MuEG-Run2011A-PromptReco-v2-AOD
drwxrwxr-x 1 cmsuser cms 512 Apr 30 14:20 ./user/susy/ntuples/data/V02-01-01/DoubleMu-Run2011A-PromptReco-v1-AOD
Solution or Workaround
We got a fast reply from Patrick Fuhrmann through dcache support. He proposed the following way to fix the problem, which turned out to work fine with the one file I tested with. Great!
- please create a ./user/susy/trash-me-new
cd /pnfs/psi.ch/cms/trivcat/store/user/susy
mkdir trash-me-ne
- make sure the new directory resides on the same database as the old (first 4 digits of the pnfs id must be identical. This is trivially true for our installation here at PSI, since we only use one DB for all CMS)
cat ".(id)(trash-me)"
00020000000000000244AAA0
cat ".(id)(trash-me-new)"
000200000000000002F47FE0
- if not already the case, adjust the tags in trash-me-new so that they are identical to trash-me and adjust the permissions of trash-me-new.
- try to move (mv) a file from trash-me to trash-me-new.
- confirming first that I indeed fail to copy the old file
srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me/NTupleProducer_41X_MC_39_1_rMW.root file:////tmp/derek-srmcp1
...
RequestFileStatus#-2141685188 failed with error:[ at Fri Jun 17 15:50:41 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000244AAA0)(type=--I--d-----)]
Fri Jun 17 15:50:45 CEST 2011: stopping copier
going to stop....
Exception in thread "main" java.lang.Exception: stopped
at gov.fnal.srm.util.Copier.run(Copier.java:244)
at java.lang.Thread.run(Thread.java:636)
Fri Jun 17 15:50:45 CEST 2011: stopping copier
- moving the file in pnfs
mv trash-me/NTupleProducer_41X_MC_39_1_rMW.root trash-me-new/
- trying to read the storageinfo of the moved file from the admin shell
echo /pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me-new/NTupleProducer_41X_MC_39_1_rMW.root | dc_get_ID_from_pnfsnamelist.sh | dc_get_storageinfo_from_IDlist.sh
00020000000000000244AB88 size=83155077;new=false;stored=false;sClass=cms:cms;cClass=-;hsm=osm;accessLatency=ONLINE;retentionPolicy=OUTPUT;flag-h=no;flag-rp=OUTPUT;flag-c=1:c8c5f6eb;StoreName=cms;flag-al=ONLINE;flag-l=83155077;store=cms;group=cms;bfid=;
- try to read the file by srmcp
srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me-new/NTupleProducer_41X_MC_39_1_rMW.root file:////tmp/derek-srmcp1
WORKS!!!!!
Now we need to do this in an automated way for all the problematic files/directories.
Repair script
You can find an example repair script which also does a fair amount of sanity checks here:
repair-dcache-dir-tags-by-copying.sh
Monitoring for this condition
As described above, you can run a find over the complete namespace and try to read out the tags. The damaged directories will throw an error.
for dir in $(find . -type d); do info=$(ls -ld $dir); echo -n "$info "; cat $dir/'.(tag)(OSMTemplate)'; done
--
DerekFeichtinger - 2011-06-17