OBSOLETE INFORMATION
basic understanding of the dCache for advanced user (archived pages)
The Storage Element ( SE )
t3se01.psi.ch
runs
dCache
, a Grid Storage middleware which transparently combines together the space made available by tens of fileservers in a single namespace called
/pnfs
; on the top of it
dCache
offers the Grid protocols
dcap gsidcap root srm gsiftp
in order to allow the Grid tools
lcg-cp xrdcp srm-cp dccp gfal-copy ...
to upload/download files into/from this single namespace. Whenever a new file get written in
/pnfs
either by a T3 user or by the
PhEDEx service
dCache
randomly selects a filesystem to host that new file, it doesn't matter the Grid protocol, the Grid tool, the user or the server used to upload the file itself ; typically more than a single file it's written in a
/pnfs
subdir and accordingly all the files inside that subdir will be randomly spread over all the available filesystems ; this files distribution implements a load-balancing among all the available filesystems and avoid any I/O bottleneck. Along the time newer, bigger and faster filesystems and fileservers replace their older peers but all these operations are transparently performed behind the scenes by a
dCache
administrator ; the T3 users won't notice these maintenances but they will be affected by a
dCache
SW upgrade or by a major fault occurred to a fileserver
dCache
it's not the first middleware aggregating heterogeneous fileservers together ( e.g. look
Gluster
or
OrangeFS
) nor probably the best one ( e.g. it can't split a file in distributed chunks like
Gluster
) but it supports very well the Grid context ( VOMS authorizations, X509s management, Space Token support, Grid protocols, ... ) so it's a good choice for our specific needs.
The Grid protocols/Grid tools versatility offered by a
dCache
setup often confuses the new T3 user since it's not always well integrated with 3rd SW like ROOT / hadd / CMSSW, it behaves differently if the file access comes from the T3 LAN or from a remote Internet site ( WAN access ), if the CMSSW environment is loaded or not, if the CRAB environment is loaded or not, and it's processed with different policies set by the T3 admins, so a nonnegligible learning period is requested in order to grasp all these protocols, tools and corner cases.
In order to use the T3 SE service at its bestest it's needed at least a basic understanding of the
dCache
internals ( a comparable effort is needed to properly use a new batch system ) ; the basic unit of a
dCache
setup is a single filesystem, necessarily hosted inside a single fileserver ; by its nature, each filesystem can sustain a certain amount of concurrent streaming operations, like downloading a 1GB
.root
file, and a higher amount of concurrent interactive operations, like opening a
.root
file from a batch job to read a fraction of it, do some computing, and after a while read another fraction. To differentiate these I/O cases
dCache
offers a
FIFO
I/O queue system per filesystem. It's up to the
dCache
administrator to select a reasonable threshold both for the streaming and the interactive cases, at T3 those are max 4 streaming operations and max ~100 concurrent interactive operations. Further I/O requests will get queued in their specific I/O queue and they won't start until an I/O slot won't get available. A T3 user will notice these "stuck" cases because his/her file request won't start like usual. If so, write immediately to
cms-tier3 AT lists.psi.ch
because 99% of the times that will be an error.More than a Grid protocol can be mapped to a same I/O queue by the
dCache
administrator ; for instance at T3 the
dcap gsidcap
Grid protocols use the same interactive I/O queue
regular
; a such overlap is made to mitigate the lack of a comprehensive I/O queue limits system since it implicitly implements the limit "max 100 dcap OR gsidcap connections" in the I/O queue
regular
associated to a single filesystem ; ideally a
dCache
administrator would create an I/O queue for each Grid protocol instead and he would define a list of constraints involving more than a single I/O queue.
Presently
dCache
CAN'T enforce limits like :
- max active I/O slots per filesystem involving all the several I/O queues using that filesystem ; it's only possible to define an isolated max active I/O slots per I/O queue
- max active I/O user slots for a specific I/O queue
- max active I/O user slots for all the I/O queues with the same name
- max active I/O user slots for all the I/O queues
- max active I/O user space in
/pnfs
all these inapplicable limits mean that a single misbehaving user will globally affect the T3 SE service ; especially the case 2. is occurred more than once in the early past. Only the T3 administrators will be able to fix these cases by usually identifying the culprit, killing his/her computational jobs and explaining what was wrong.
The I/O queues system and the Grid protocols mapping
by the following
watch
command ( to be executed on a
t3ui
server ) we can observe both the filesystems and their several I/O queues ; the
Movers
column reports the sums of the several
Restores Stores P2P-Server P2P-Client regular wan xrootd
Active/Max/Queued counters ; each T3 user is affected only by the
regular wan xrootd
traffic :
$ watch --interval=1 --differences 'lynx -dump -width=800 http://t3dcachedb.psi.ch:2288/queueInfo | grep -v __________ | grep -v ops '
globus-url-copy
Copying a dir between two GridFTP server - serial method
The
globus-url-copy
tool can copy file, files and
recursively ( but serially ) a whole dir from a
GridFTP server to another ; the file transfer will occur directly between the two
GridFTP servers ; you'll have to know the absolute paths both on the sender and the receiver side ; in the next example we're going to copy the dir :
-
gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
- into :
-
gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/
the path prefix
/gpfs/ddn/srm/cms/ has been discovered by a
uberftp gsiftp://stormgf2.pi.infn.it
session ; if you're in doubt contact the T3 administrators and we'll help you to identify this kind of prefixes ; at T3 / T2 the absolute paths are always respectively
/pnfs/psi.ch/cms
and
/pnfs/lcg.cscs.ch/cms
the dir copy example :
$ globus-url-copy -continue-on-error -rst -nodcau -fast -vb -v -cd -r gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/ gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/
Source: gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
Dest: gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/
DYJetsToLL_M-50_HT-100to200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/
Source: gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
Dest: gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/
DYJetsToLL_M-50_HT-200to400_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/
...
Copying a dir between two GridFTP servers by GNU parallel
The tools
globus-url-copy
,
uberftp
,
GNU parallel
can be used together to copy,
in parallel, a dir between two
GridFTP servers, in this example a
C.Galloni /pnfs dir into a
MDefranc /pnfs dir ; no files will be routed trough the server running the globus-url-copy commands itself ( e.g. your UI, or a WN ) ; furthermore, since in a Grid environment each
GridFTP server often acts as a transparent proxy to more than a
GridFTP server, the copies will occur between a
matrix 2x2 of
GridFTP servers ; a bottleneck in the parallelism might occur due to the limited bandwidth available between the 2 data centres more than to the total amount of
GridFTP servers involved. It's not compulsory but we recommend to run all the globus-url-copy commands in a
screen -L
session to avoid to get interrupted the copies just because of a connection cut to the server where you've started them ; anyway it's safe to repeat the same globus-url-copy commands over and over again.
Copying a T3 /pnfs dir into another T3 /pnfs dir ( use case requested by the users just once )
1st of all we'll generate the globus-url-copy commands to be passed as input to
GNU parallel
; we'll save them into the file
tobecopied
; afterward we'll started them in
parallel ; we can arbitrarily choose how many parallel globus-url-copy commands to run by the
GNU parallel
parameter
-j N
; each globus-url-copy command will consume a CPU core on the server on which you're running it so don't set a
-j
parameter greater than the amount of CPU cores there available :
$ uberftp -ls -r gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/cgalloni//RunII/Ntuple_080316/ | grep .root$ | awk {' print "globus-url-copy -v -cd gsiftp://t3se01.psi.ch/"$8" gsiftp://t3se01.psi.ch/"$8}' | sed 's/cgalloni/mdefranc/2' > tobecopied
$ # 10 parallel globus-url-copy
$ cat tobecopied | parallel -j 10
Copying a T2 /pnfs dir into a T3 /pnfs dir ( recurring use case )
Because this time the source site is different from the destination site we can increase the
GNU parallel
parameter from
-j 10
to, for instance,
-j 30
; for a copy from a T1/T2 to a T2 you might set
-j 50
; regrettably it's impossible for an ordinary user to compute the correct
-j
; again you might want to start the copies by a
screen -L
session, but it's not compulsory.
$ uberftp -ls -r gsiftp://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/cgalloni/Ntuple_290216/WJetsToQQ_HT-600ToInf_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/ | grep .root$ | awk {' print "globus-url-copy -v -cd gsiftp://storage01.lcg.cscs.ch//"$8" gsiftp://t3se01.psi.ch/"$8}' | sed 's/cgalloni/mdefranc/2' | sed 's/lcg.cscs.ch/psi.ch/3' > tobecopied
$ # 30 parallel globus-url-copy
$ cat tobecopied | parallel -j 30
--
NinaLoktionova - 2018-11-20