Tags:
create new tag
view all tags

IssueDcacheCorruptOSM

Symptoms

Summary: User cannot retrieve certain files which clearly exist on our fileservers and also are correctly listed in the pnfs namespace

Occurrences

At what times did this problem occur (used to estimate frequency):

2011-06-16

Observations

NOTE: We are running dcache golden release 1.9.5-16 with pnfs-postgresql-3.1.10-7 and postgresql-8.3.3

srmcp and dccp failures

Failure of an srmcp points to underlying reason
srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root file:////tmp/derek-srmcp1

...
Fri Jun 17 12:08:47 CEST 2011: srmPrepareToGet update failed, status : SRM_FAILURE explanation= at Fri Jun 17 12:08:47 CEST 2011 state Failed :  at Fri Jun 17 12:08:42 CEST 2011 state Pending : created
RequestFileStatus#-2141693797 failed with error:[  at Fri Jun 17 12:08:43 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)]

Fri Jun 17 12:08:47 CEST 2011: GetFileRequest[srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root] status=SRM_INVALID_PATH explanation=Failed
Fri Jun 17 12:08:47 CEST 2011: java.io.IOException: srmPrepareToGet update failed, status : SRM_FAILURE explanation= at Fri Jun 17 12:08:47 CEST 2011 state Failed :  at Fri Jun 17 12:08:42 CEST 2011 state Pending : created
RequestFileStatus#-2141693797 failed with error:[  at Fri Jun 17 12:08:43 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)]

Failure of dccp contains less useful output

dccp -d 15 -p 20000:25000 dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root /tmp/derek-dccp

...
Sending control message: 2 0 client open "dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root" r t3ui03.psi.ch 20000 -timeout=-1 -onerror=default -uid=3896
Server error message for [2]: "No such file or directory" (err code: 2, errno: (null)).
dcap_pool:  POLLIN on control line [3] id=1
Polling data for destination[4] queueID[2].
[-1] unpluging node
Removing unneeded queue [2]
[-1] destroing node
Failed open file in the dCache.
Can't open source file : "No such file or directory"
System error: Input/output error

Files have missing OSM info. Storageinfoof command fails

Listing of the storageinfo in the adminshell fails.

echo /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root|dc_get_ID_from_pnfsnamelist.sh |dc_get_storageinfo_from_IDlist.sh 

0002000000000000027647F0 storageinfoof failed : OSM info not found in /pnfs/fs/.(access)(00020000000000000273F0C0)(type=--I--d-----)

Note, that we can get other info on the file and that we can also confirm that the local file on the pool exists!

echo /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/NTupleProducer_41X_MC_97_1_iAx.root|dc_get_ID_from_pnfsnamelist.sh |dc_get_cacheinfo_from_IDlist.sh 

0002000000000000027647F0 t3fs09_cms_1

Missing pnfs tags for directories

All files in the directory /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6 I tested, failed to copy with the identical error. It turns out that the directory seems to have a corrupted tag for the OSMTemplate value. All other tags which should have been inherited also cannot be read:

[root@t3se01 store]# dir=/pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(OSMTemplate): No such file or directory
.(tag)(OSMTemplate)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(sGroup): No such file or directory
.(tag)(sGroup)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(AccessLatency): No such file or directory
.(tag)(AccessLatency)=
cat: /pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6/.(tag)(RetentionPolicy): No such file or directory
.(tag)(RetentionPolicy)=

Testing the directory above which is ok shows the correct values:

dir=/pnfs/psi.ch/cms/trivcat/store/user/susy/ntuples/mc/V02-01-01/; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
.(tag)(OSMTemplate)=StoreName cms
.(tag)(sGroup)=cms
.(tag)(AccessLatency)=ONLINE
.(tag)(RetentionPolicy)=OUTPUT

I ran a find over the complete namespace to locate all corrupted directories

 for dir in $(find . -type d); do info=$(ls -ld $dir); echo -n "$info    "; cat $dir/'.(tag)(OSMTemplate)'; done &> ~/OSMTemplate-corrupt-dirs-20110617-lst

Out of 13802 directories, the following 40 were found with erroneous OSMTemplate tag:

 grep -v StoreName ~/OSMTemplate-corrupt-dirs-20110617-lst | sed -e 's#cat: \(.*\)/\..*#\1#' | xargs -i ls -ld {}

drwxrwxr-x  1 cmsuser cms 512 Apr 14 16:04 ./user/pablom/TnP/ElectronsNew/data/mt2/JOB_PD_EG_PromptReco_Run2010B_TnP_ELE_JZB_ANTIPANIC_ANTIMULTITRG_DATA
drwxrwxr-x  1 cmsuser cms 512 Apr 14 15:51 ./user/pablom/TnP/ElectronsNew/data/mt2/JOB_PD_EG_17thSepReReco_Run2010A_TnP_ELE_JZB_ANTIPANIC_ANTIMULTITRG_DATA
drwxrwxr-x  1 cmsuser cms 512 Dec 16  2010 ./user/bortigno/WJets/FASTSIM_bug
drwxrwxr-x  1 cmsuser cms 512 Apr 21 20:22 ./user/susy/trash-me
drwxrwxr-x  1 cmsuser cms 512 Apr 15 15:02 ./user/susy/trash-me/QCD_TuneD6T_HT-500To1000_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:46 ./user/susy/trash-me/GJets_TuneD6T_HT-200_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:45 ./user/susy/trash-me/TToBLNu_TuneZ2_tW-channel_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:44 ./user/susy/trash-me/GJets_TuneD6T_HT-100To200_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:42 ./user/susy/trash-me/DYJetsToLL_TuneZ2_M-50_7TeV-madgraph-tauola
drwxrwxr-x  1 cmsuser cms 512 Apr 19 07:26 ./user/susy/trash-me/VVJetsTo4L_TuneD6T_7TeV-madgraph-tauola
drwxrwxr-x  1 cmsuser cms 512 Apr 15 15:19 ./user/susy/trash-me/LM12_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM13_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 19 07:24 ./user/susy/trash-me/TToBLNu_TuneZ2_s-channel_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:37 ./user/susy/trash-me/GJets_TuneD6T_HT-40To100_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM3_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 20 00:36 ./user/susy/trash-me/TToBLNu_TuneZ2_t-channel_7TeV-madgraph
drwxrwxr-x  1 cmsuser cms 512 Apr 15 15:29 ./user/susy/trash-me/LM4_SUSY_sftsht_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 Mar 22 16:45 ./user/susy/ntuples/mc/V01-11-02/GMSBLongLivedNeutralino_CTau-2000_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 Mar 19 17:46 ./user/susy/ntuples/mc/V01-11-02/WWtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x  1 cmsuser cms 512 Mar 19 23:51 ./user/susy/ntuples/mc/V01-11-02/WZtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x  1 cmsuser cms 512 Mar 20 00:30 ./user/susy/ntuples/mc/V01-11-02/ZZtoAnything_TuneZ2_7TeV-pythia6-tauola
drwxrwxr-x  1 cmsuser cms 512 Mar 19 19:44 ./user/susy/ntuples/mc/V01-11-02/WWTo2L2Nu_TuneZ2_7TeV-pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 12:07 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_5to15_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 20:13 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_120to170_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 12:06 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_300to470_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 15:13 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_470to600_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 02:17 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_30to50_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 15:27 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_80to120_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 23:45 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_600to800_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  2 02:57 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_800to1000_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 05:10 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_50to80_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 01:18 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1400to1800_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 30 22:42 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1000to1400_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 30 23:12 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_170to300_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 30 15:28 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_15to30_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 May  1 03:32 ./user/susy/ntuples/mc/V02-01-01/QCD_Pt_1800_TuneZ2_7TeV_pythia6
drwxrwxr-x  1 cmsuser cms 512 Apr 30 13:39 ./user/susy/ntuples/data/V02-01-01/DoubleMu-Run2011A-PromptReco-v2-AOD
drwxrwxr-x  1 cmsuser cms 512 Apr 30 15:38 ./user/susy/ntuples/data/V02-01-01/MuEG-Run2011A-PromptReco-v1-AOD
drwxrwxr-x  1 cmsuser cms 512 Apr 30 15:15 ./user/susy/ntuples/data/V02-01-01/MuEG-Run2011A-PromptReco-v2-AOD
drwxrwxr-x  1 cmsuser cms 512 Apr 30 14:20 ./user/susy/ntuples/data/V02-01-01/DoubleMu-Run2011A-PromptReco-v1-AOD

Solution or Workaround

We got a fast reply from Patrick Fuhrmann through dcache support. He proposed the following way to fix the problem, which turned out to work fine with the one file I tested with. Great!

  1. please create a ./user/susy/trash-me-new
    cd /pnfs/psi.ch/cms/trivcat/store/user/susy
    mkdir trash-me-ne
    
  2. make sure the new directory resides on the same database as the old (first 4 digits of the pnfs id must be identical. This is trivially true for our installation here at PSI, since we only use one DB for all CMS)
    cat ".(id)(trash-me)"
    00020000000000000244AAA0
    
    cat ".(id)(trash-me-new)"
    000200000000000002F47FE0
    
  3. if not already the case, adjust the tags in trash-me-new so that they are identical to trash-me and adjust the permissions of trash-me-new.
    • I cannot read the tags on the old directory, but the trash-me-new directory has the tags I would expect it to have
       dir=trash-me-new; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
      .(tag)(OSMTemplate)=StoreName cms
      .(tag)(sGroup)=cms
      .(tag)(AccessLatency)=ONLINE
      .(tag)(RetentionPolicy)=OUTPUT
      
      dir=trash-me; for n in $(cat $dir/'.(tags)()'); do r=$(cat $dir/$n); echo "$n=$r"; done
      cat: trash-me/.(tag)(OSMTemplate): No such file or directory
      .(tag)(OSMTemplate)=
      cat: trash-me/.(tag)(sGroup): No such file or directory
      .(tag)(sGroup)=
      cat: trash-me/.(tag)(AccessLatency): No such file or directory
      .(tag)(AccessLatency)=
      cat: trash-me/.(tag)(RetentionPolicy): No such file or directory
      .(tag)(RetentionPolicy)=
      
  4. try to move (mv) a file from trash-me to trash-me-new.
    • confirming first that I indeed fail to copy the old file
      srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me/NTupleProducer_41X_MC_39_1_rMW.root file:////tmp/derek-srmcp1
      
      ...
      RequestFileStatus#-2141685188 failed with error:[  at Fri Jun 17 15:50:41 CEST 2011 state Failed : file not found : OSM info not found in /pnfs/fs/.(access)(00020000000000000244AAA0)(type=--I--d-----)]
      
      Fri Jun 17 15:50:45 CEST 2011: stopping copier
      going to stop....
      Exception in thread "main" java.lang.Exception:  stopped 
      	at gov.fnal.srm.util.Copier.run(Copier.java:244)
      	at java.lang.Thread.run(Thread.java:636)
      Fri Jun 17 15:50:45 CEST 2011: stopping copier
      
    • moving the file in pnfs
      mv trash-me/NTupleProducer_41X_MC_39_1_rMW.root trash-me-new/
      
    • trying to read the storageinfo of the moved file from the admin shell DONE
      echo /pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me-new/NTupleProducer_41X_MC_39_1_rMW.root | dc_get_ID_from_pnfsnamelist.sh | dc_get_storageinfo_from_IDlist.sh
      
      00020000000000000244AB88 size=83155077;new=false;stored=false;sClass=cms:cms;cClass=-;hsm=osm;accessLatency=ONLINE;retentionPolicy=OUTPUT;flag-h=no;flag-rp=OUTPUT;flag-c=1:c8c5f6eb;StoreName=cms;flag-al=ONLINE;flag-l=83155077;store=cms;group=cms;bfid=;
      
  5. try to read the file by srmcp
    srmcp -2 -streams_num=1 -globus_tcp_port_range=20000,25000 -debug srm://t3se01.psi.ch:8443/srm/managerv2?SFN=//pnfs/psi.ch/cms/trivcat/store/user/susy/trash-me-new/NTupleProducer_41X_MC_39_1_rMW.root file:////tmp/derek-srmcp1
    

WORKS!!!!! smile

Now we need to do this in an automated way for all the problematic files/directories.

Repair script

You can find an example repair script which also does a fair amount of sanity checks here: repair-dcache-dir-tags-by-copying.sh

Monitoring for this condition

As described above, you can run a find over the complete namespace and try to read out the tags. The damaged directories will throw an error.

 for dir in $(find . -type d); do info=$(ls -ld $dir); echo -n "$info    "; cat $dir/'.(tag)(OSMTemplate)'; done

-- DerekFeichtinger - 2011-06-17

IssueForm
Affected Service dcache reading
Symptom summary User cannot retrieve certain files which clearly exist on our fileservers and also are correctly listed in the pnfs namespace
Reason Understood yes
Solution Exists yes
Obsolete no
Topic attachments
I Attachment History Action Size Date Who Comment
Unix shell scriptsh repair-dcache-dir-tags-by-copying.sh r1 manage 3.6 K 2011-06-28 - 08:09 DerekFeichtinger Example repair script
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r3 - 2011-06-28 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback