Go to
previous page /
next page of CMS site log
17. 09. 2009 lcg-cp stageout problems from CRAB jobs
NOTE: This problem was reported on
this hypernews item. The problem is tracked on
this Savannah support request. It also has been submitted to the dcache support list on 2009-09-18 as tracker item #5109.
Andrea Rizzi and Andreas Schaetti reported on stageout failures from their CRAB jobs.
The relevant part of the CRAB log output is
########## contents of SE interaction
2009-09-17 15:15:12.751466:
Executed: lcg-ls -b -D srmv2 -t 2400 --verbose srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/W
H_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root
Done with exit code: 256
and output:
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN
_MC_2.root: [SE][Ls][SRM_INVALID_PATH] could not get storage info by path : CacheException(rc=10001;msg=path /pnfs/fs/usr/cms/trivcat/store/user/arizzi/WH
_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root not found ( .(id)(hwbbar115_10TeV_GEN_M
C_2.root) ))
SE type: SRMv2
2009-09-17 15:15:13.890772:
Executed: lcg-cp --verbose --vo=cms -b -D srmv2 -t 2400 --verbose file:///home/egee/cms074/globus-tmp.wn36.5872.0/https_3a_2f_2fwms213.cern.ch_
3a9000_2fvgkrUkMs0YPpBCGY4QTPjg/CMSSW_3_1_2/hwbbar115_10TeV_GEN_MC_2.root srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat
/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root
Done with exit code: 256
and output:
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
[SE][Mkdir][SRM_INVALID_PATH] srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v
2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b/hwbbar115_10TeV_GEN_MC_2.root: srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.
cscs.ch/cms/trivcat/store/user/arizzi/WH_HTobb_Pt100_M115_GEN_v2/WH_HTobb_Pt100_M115_GEN_v2/3804f52f25a016d6eb88c4371b906f7b : parent path or a component
of the parent path does not exist
lcg_cp: No such file or directory
Andrea Rizzi's user directory exists, but none of the subdirectories does exist. It seems that
lcg-cp
does not create automatically all the required subdirectories for a request. The job seem to run fine at T2_IT_Pisa.
lcg-cp refuses to create more than one subdirectory layer at T2_CH_CSCS - this seems intentional!
lcg-cp (executed from CSCS-UI) with implicit creation of one subdirectory works, while implict creation of two directories fails. This behavior seems to be intentional, and dcache responds with a specific error message about not being able to create the nested directory, because the parent directory is not there.
I was able to confirm the path creation behavior in a few tests. Note that our site is running dcache-1.9.3-3 at the moment of these tests.
- First I confirm that the path
/pnfs/lcg.cscs.ch/cms/local_tests
exists
lcg-ls -b -D srmv2 --srm-timeout 2400 --verbose srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests
SE type: SRMv2
/pnfs/lcg.cscs.ch/cms/local_tests/automatic_test-20080904-2021-8387-srm2b
/pnfs/lcg.cscs.ch/cms/local_tests/automatic_test-20081207-1239-8889-gftp
...
- Now I try to copy a file nested in two subdirectories to this directory, and this fails with the exact same error.
lcg-cp --verbose --vo=cms -b -D srmv2 -t 2400 --verbose file:///tmp/dcachetest-20090917-1352-3942/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2/lcg-cp-derek1
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
[SE][Mkdir][SRM_FAILURE] srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2/lcg-cp-derek1: srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/derekdir2 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/derekdir1/derekdir2 not found ( .(id)(derekdir2) )
lcg_cp: Invalid argument
- Now I try the same copy, but with only one subdirectory in the request, and this succeeds
lcg-cp --verbose --vo=cms -b -D srmv2 -t 2400 --verbose file:///tmp/dcachetest-20090917-1352-3942/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/lcg-cp-derek1
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
Destination SRM Request Token: -2136239017
Source URL: file:/tmp/dcachetest-20090917-1352-3942/srcfile
File size: 51200
Source URL for copy: file:/tmp/dcachetest-20090917-1352-3942/srcfile
Destination URL: gsiftp://se16.lcg.cscs.ch:2811//pnfs/lcg.cscs.ch/cms/local_tests/derekdir1/lcg-cp-derek1
# streams: 1
51200 bytes 49.72 KB/sec avg 49.72 KB/sec inst
Transfer took 2020 ms
lcg-cp correctly creates multiple subdirectory layers at T2_IT_Pisa
Here I can confirm that the creation of two layers of subdirectories is working at T2_IT_Pisa. The lcg-cp is again executed from CSCS-UI, so any differences observed must be attributed to the SE.
- Creation of a test user directory for my username
srmmkdir srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti
- Transfer of a simple file
lcg-cp --verbose --vo=cms -b -D srmv2 -t 2400 --verbose file:///tmp/dcachetest-20090917-1205-24206/srcfile srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti/lcg-cp-derek5
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
Destination SRM Request Token: -2141283420
Source URL: file:/tmp/dcachetest-20090917-1205-24206/srcfile
File size: 51200
Source URL for copy: file:/tmp/dcachetest-20090917-1205-24206/srcfile
Destination URL: gsiftp://cmsdcache10.pi.infn.it:2811//pnfs/pi.infn.it/data/cms/store/user/dfeichti/lcg-cp-derek5
# streams: 1
51200 bytes 42.96 KB/sec avg 42.96 KB/sec inst
Transfer took 2060 ms
- Transfer of a file with creation of two directory layers
lcg-cp --verbose --vo=cms -b -D srmv2 -t 2400 --verbose file:///tmp/dcachetest-20090917-1205-24206/srcfile srm://cmsdcache.pi.infn.it:8443/srm/managerv2?SFN=/pnfs/pi.infn.it/data/cms/store/user/dfeichti/subdir1/subdir2/lcg-cp-derek5
Warning: -t,--timeout is deprecated! Use --timeout-* options instead
Using grid catalog type: UNKNOWN
Using grid catalog : (null)
VO name: cms
Checksum type: None
Destination SE type: SRMv2
Destination SRM Request Token: -2141283364
Source URL: file:/tmp/dcachetest-20090917-1205-24206/srcfile
File size: 51200
Source URL for copy: file:/tmp/dcachetest-20090917-1205-24206/srcfile
Destination URL: gsiftp://cmsdcache7.pi.infn.it:2811//pnfs/pi.infn.it/data/cms/store/user/dfeichti/subdir1/subdir2/lcg-cp-derek5
# streams: 1
51200 bytes 44.34 KB/sec avg 44.34 KB/sec inst
Transfer took 2060 ms
srmcp succeeds in creating nested subdirectories at CSCS
Contrary to lcg-cp, srmcp has no problem to create the implicit two sub directories
Executing from CSCS UI:
srmcp --debug -2 file:////tmp/dcachetest-20090917-1205-24206/srcfile srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/local_tests/dfsub1/dfsub2/df1
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/opt/d-cache/srm
Storage Resource Manager (SRM) Client version 2.1.2
Copyright (c) 2002-2008 Fermi National Accelerator Laboratory
SRM Configuration:
default_port=8443
debug=true
...
...
execution of CopyJob, source = file:////tmp/dcachetest-20090917-1205-24206/srcfile destination = gsiftp://se25.lcg.cscs.ch:2811//pnfs/lcg.cscs.ch/cms/local_tests/dfsub1/dfsub2/df1 completed
SRMClientV2 : srmPutDone , contacting service httpg://storage01.lcg.cscs.ch:8443/srm/managerv2
srmPutDone status code=SRM_SUCCESS
copy_jobs is empty
stopping copier
Differences between T2_CH_CSCS and T2_IT_PISA
As noted above, the lcg-cp tests were all executed from CSCS-UI, so a difference in lcg-cp version cannot be responsible for the different behavior. My guess is either dcache version or dcache configuration.
|
T2_CH_CSCS |
T2_IT_PISA |
Storage Manager |
dcache-1.9.3 |
dcache-1.8.0-15p5 |
namespace |
pnfs |
pnfs |
lcg-util version |
1.7.6-1 |
1.7.4-1 |
GFAL-client |
1.11.8-1 |
1.11.6-2 |
dcache configuration?
On the T2_CH_CSCS dcache, the recursive directory creation is correctly enabled:
# ---- Enable automatic creation of directories.
#
# Allow automatic creation of directories via SRM
#
# allow=true, disallow=false
#
RecursiveDirectoryCreation=true
A look at the srm.batch file that sets the properties defaults, confirms
set context -c RecursiveDirectoryCreation true
The behavior at CSCS is inconsistent for lcg-cp (but not for srmcp)
It turns out that 2-layer directory creation sometimes succeeds at CSCS.
Therefore I used a small script to run a larger number of tests each against a few SEs.
All tests ran from the CSCS UI
Estonia runs the exact same dcache version as we do, and they also still have pnfs. All tests I did on their site succeeded, so this points to some local problem at CSCS. My suspicions are mostly targeted at the pnfs namespace... Still: The fact that lcg-cp and srmcp show such different behavior on our site is a bit unsettling.
1-layer directory creation always succeeds
Running the tests with srmcp against CSCS always succeeds
04. 02. 2010 Problem solved after updates to dcache 1.9.x
Running the test against the newer dcache versions at CSCS always shows successful runs.
--
DerekFeichtinger - 2009-09-17
Go to
previous page /
next page of CMS site log