Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of CMS site log MOVED TO...

17. 09. 2009 lcg-cp stageout problems from CRAB jobs

NOTE: This problem was reported on this hypernews item. The problem is tracked on this Savannah support request. It also has been submitted to the dcache support list on 2009-09-18 as tracker item #5109.

Andrea Rizzi and Andreas Schaetti reported on stageout failures from their CRAB jobs.

The relevant part of the CRAB log output is

########## contents of SE interaction
2009-09-17 15:15:12.751466:
Executed:       lcg-ls -b -D srmv2  -t 2400 --verbose srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/W
H_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root
Done with exit code:    256
and output:
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN
_MC_2.root: [SE][Ls][SRM_INVALID_PATH] could not get storage info by path : CacheException(rc=10001;msg=path /pnfs/fs/usr/cms/trivcat/store/user/arizzi/WH
_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root not found ( .(id)(hwbbar115_10TeV_GEN_M
C_2.root) ))
SE type: SRMv2

2009-09-17 15:15:13.890772:
Executed:        lcg-cp  --verbose  --vo=cms  -b -D srmv2  -t 2400 --verbose file:///home/egee/cms074/globus-tmp.wn36.5872.0/https_3a_2f_2fwms213.cern.ch_
3a9000_2fvgkrUkMs0YPpBCGY4QTPjg/CMSSW_3_1_2/hwbbar115_10TeV_GEN_MC_2.root srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat
/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root
Done with exit code:    256
and output:
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
[SE][Mkdir][SRM_INVALID_PATH] srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v
2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root: srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.
cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b : parent path or a component
of the parent path does not exist
lcg_cp: No such file or directory

Andrea Rizzi's user directory exists, but none of the subdirectories does exist. It seems that lcg-cp does not create automatically all the required subdirectories for a request. The job seem to run fine at T2_IT_Pisa.

lcg-cp refuses to create more than one subdirectory layer at T2_CH_CSCS - this seems intentional!

lcg-cp (executed from CSCS-UI) with implicit creation of one subdirectory works, while implict creation of two directories fails. This behavior seems to be intentional, and dcache responds with a specific error message about not being able to create the nested directory, because the parent directory is not there.

I was able to confirm the path creation behavior in a few tests. Note that our site is running dcache-1.9.3-3 at the moment of these tests.

  • DONE First I confirm that the path /pnfs/lcg.cscs.ch/cms/local_tests exists
    lcg-ls -b -D srmv2 --srm-timeout 2400 --verbose srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests
    SE type: SRMv2
    
    /pnfs/lcg.cscs.ch/cms/local_tests/automatic_test-20080904-2021-8387-srm2b
    /pnfs/lcg.cscs.ch/cms/local_tests/automatic_test-20081207-1239-8889-gftp
    ...
    
  • No Now I try to copy a file nested in two subdirectories to this directory, and this fails with the exact same error.
    lcg-cp  --verbose  --vo=cms  -b -D srmv2  -t 2400 --verbose file:///tmp/dcachetest-20090917-1352-3942/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2/lcg-cp-derek1
    
    Warning: -t,--timeout is deprecated! Use --timeout-* options instead
    Using grid catalog type: UNKNOWN
    Using grid catalog : (null)
    VO name: cms
    Checksum type: None
    Destination SE type: SRMv2
    [SE][Mkdir][SRM_FAILURE] srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2/lcg-cp-derek1: srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/derekdir1/derekdir2 not found ( .(id)(derekdir2) )
    lcg_cp: Invalid argument
    
  • DONE Now I try the same copy, but with only one subdirectory in the request, and this succeeds
    lcg-cp  --verbose  --vo=cms  -b -D srmv2  -t 2400 --verbose file:///tmp/dcachetest-20090917-1352-3942/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/lcg-cp-derek1
    
    Warning: -t,--timeout is deprecated! Use --timeout-* options instead
    Using grid catalog type: UNKNOWN
    Using grid catalog : (null)
    VO name: cms
    Checksum type: None
    Destination SE type: SRMv2
    Destination SRM Request Token: -2136239017
    Source URL: file:/tmp/dcachetest-20090917-1352-3942/srcfile
    File size: 51200
    Source URL for copy: file:/tmp/dcachetest-20090917-1352-3942/srcfile
    Destination URL: gsiftp://se16.lcg.cscs.ch:2811//pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/lcg-cp-derek1
    # streams: 1
            51200 bytes     49.72 KB/sec avg     49.72 KB/sec inst
    Transfer took 2020 ms
    

lcg-cp correctly creates multiple subdirectory layers at T2_IT_Pisa

Here I can confirm that the creation of two layers of subdirectories is working at T2_IT_Pisa. The lcg-cp is again executed from CSCS-UI, so any differences observed must be attributed to the SE.

  • DONE Creation of a test user directory for my username
    srmmkdir srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti
    
  • DONE Transfer of a simple file
    lcg-cp --verbose --vo=cms -b -D srmv2  -t 2400 --verbose file:///tmp/dcachetest-20090917-1205-24206/srcfile   srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti/lcg-cp-derek5
    Warning: -t,--timeout is deprecated! Use --timeout-* options instead
    Using grid catalog type: UNKNOWN
    Using grid catalog : (null)
    VO name: cms
    Checksum type: None
    Destination SE type: SRMv2
    Destination SRM Request Token: -2141283420
    Source URL: file:/tmp/dcachetest-20090917-1205-24206/srcfile
    File size: 51200
    Source URL for copy: file:/tmp/dcachetest-20090917-1205-24206/srcfile
    Destination URL: gsiftp://cmsdcache10.pi.infn.it:2811//pnfs/pi.infn.it/data/cms/store/user/dfeichti/lcg-cp-derek5
    # streams: 1
            51200 bytes     42.96 KB/sec avg     42.96 KB/sec inst
    Transfer took 2060 ms
    
  • DONE Transfer of a file with creation of two directory layers
    lcg-cp --verbose --vo=cms -b -D srmv2  -t 2400 --verbose file:///tmp/dcachetest-20090917-1205-24206/srcfile   srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti/subdir1/subdir2/lcg-cp-derek5
    Warning: -t,--timeout is deprecated! Use --timeout-* options instead
    Using grid catalog type: UNKNOWN
    Using grid catalog : (null)
    VO name: cms
    Checksum type: None
    Destination SE type: SRMv2
    Destination SRM Request Token: -2141283364
    Source URL: file:/tmp/dcachetest-20090917-1205-24206/srcfile
    File size: 51200
    Source URL for copy: file:/tmp/dcachetest-20090917-1205-24206/srcfile
    Destination URL: gsiftp://cmsdcache7.pi.infn.it:2811//pnfs/pi.infn.it/data/cms/store/user/dfeichti/subdir1/subdir2/lcg-cp-derek5
    # streams: 1
            51200 bytes     44.34 KB/sec avg     44.34 KB/sec inst
    Transfer took 2060 ms
    

srmcp succeeds in creating nested subdirectories at CSCS

Contrary to lcg-cp, srmcp has no problem to create the implicit two sub directories

Executing from CSCS UI:

srmcp --debug -2 file:////tmp/dcachetest-20090917-1205-24206/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/dfsub1/dfsub2/df1

WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
Storage Resource Manager (SRM) Client version 2.1.2
Copyright (c) 2002-2008 Fermi National Accelerator Laboratory

SRM Configuration:
	default_port=8443
	debug=true
...
...
execution of CopyJob, source = file:////tmp/dcachetest-20090917-1205-24206/srcfile destination = gsiftp://se25.lcg.cscs.ch:2811//pnfs/lcg.cscs.ch/cms/local_tests/dfsub1/dfsub2/df1 completed
SRMClientV2 : srmPutDone , contacting service httpg://storage01.lcg.cscs.ch:8443/srm/managerv2
srmPutDone status code=SRM_SUCCESS
copy_jobs is empty
stopping copier

Differences between T2_CH_CSCS and T2_IT_PISA

As noted above, the lcg-cp tests were all executed from CSCS-UI, so a difference in lcg-cp version cannot be responsible for the different behavior. My guess is either dcache version or dcache configuration.

T2_CH_CSCS T2_IT_PISA
Storage Manager dcache-1.9.3 dcache-1.8.0-15p5
namespace pnfs pnfs
lcg-util version 1.7.6-1 1.7.4-1
GFAL-client 1.11.8-1 1.11.6-2

dcache configuration?

On the T2_CH_CSCS dcache, the recursive directory creation is correctly enabled:

#  ---- Enable automatic creation of directories.
#
#  Allow automatic creation of directories via SRM
#
#  allow=true, disallow=false
#
RecursiveDirectoryCreation=true

A look at the srm.batch file that sets the properties defaults, confirms

set context -c RecursiveDirectoryCreation  true

The behavior at CSCS is inconsistent for lcg-cp (but not for srmcp)

It turns out that 2-layer directory creation sometimes succeeds at CSCS. Therefore I used a small script to run a larger number of tests each against a few SEs.

All tests ran from the CSCS UI

lcg-cp 2-layer implicit directory creation
SE dcache version namespace Failures/Total tries
CSCS 1.9.3-3 pnfs 9/20
Estonia 1.9.3-3 pnfs 0/20
PSI 1.9.2-4 pnfs 0/20
Pisa 1.8.0-p15 pnfs 0/20

Estonia runs the exact same dcache version as we do, and they also still have pnfs. All tests I did on their site succeeded, so this points to some local problem at CSCS. My suspicions are mostly targeted at the pnfs namespace... Still: The fact that lcg-cp and srmcp show such different behavior on our site is a bit unsettling.

1-layer directory creation always succeeds

lcg-cp 1-layer implicit directory creation
SE dcache version Failures
CSCS 1.9.3-3 0/20

Running the tests with srmcp against CSCS always succeeds

srmcp 2-layer implicit directory creation
SE dcache version Failures
CSCS 1.9.3-3 0/20

04. 02. 2010 Problem solved after updates to dcache 1.9.x

Running the test against the newer dcache versions at CSCS always shows successful runs.

-- DerekFeichtinger - 2009-09-17


Arrow left Go to previous page / next page of CMS site log MOVED TO...

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2010-02-04 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback