create new tag
view all tags

basic understanding of the dCache for advanced user (archived pages)

The Storage Element ( SE ) t3se01.psi.ch runs dCache, a Grid Storage middleware which transparently combines together the space made available by tens of fileservers in a single namespace called /pnfs ; on the top of it dCache offers the Grid protocols dcap gsidcap root srm gsiftp in order to allow the Grid tools lcg-cp xrdcp srm-cp dccp gfal-copy ... to upload/download files into/from this single namespace. Whenever a new file get written in /pnfs either by a T3 user or by the PhEDEx service dCache randomly selects a filesystem to host that new file, it doesn't matter the Grid protocol, the Grid tool, the user or the server used to upload the file itself ; typically more than a single file it's written in a /pnfs subdir and accordingly all the files inside that subdir will be randomly spread over all the available filesystems ; this files distribution implements a load-balancing among all the available filesystems and avoid any I/O bottleneck. Along the time newer, bigger and faster filesystems and fileservers replace their older peers but all these operations are transparently performed behind the scenes by a dCache administrator ; the T3 users won't notice these maintenances but they will be affected by a dCache SW upgrade or by a major fault occurred to a fileserver dCache it's not the first middleware aggregating heterogeneous fileservers together ( e.g. look Gluster or OrangeFS ) nor probably the best one ( e.g. it can't split a file in distributed chunks like Gluster ) but it supports very well the Grid context ( VOMS authorizations, X509s management, Space Token support, Grid protocols, ... ) so it's a good choice for our specific needs.

The Grid protocols/Grid tools versatility offered by a dCache setup often confuses the new T3 user since it's not always well integrated with 3rd SW like ROOT / hadd / CMSSW, it behaves differently if the file access comes from the T3 LAN or from a remote Internet site ( WAN access ), if the CMSSW environment is loaded or not, if the CRAB environment is loaded or not, and it's processed with different policies set by the T3 admins, so a nonnegligible learning period is requested in order to grasp all these protocols, tools and corner cases.

In order to use the T3 SE service at its bestest it's needed at least a basic understanding of the dCache internals ( a comparable effort is needed to properly use a new batch system ) ; the basic unit of a dCache setup is a single filesystem, necessarily hosted inside a single fileserver ; by its nature, each filesystem can sustain a certain amount of concurrent streaming operations, like downloading a 1GB .root file, and a higher amount of concurrent interactive operations, like opening a .root file from a batch job to read a fraction of it, do some computing, and after a while read another fraction. To differentiate these I/O cases dCache offers a FIFO I/O queue system per filesystem. It's up to the dCache administrator to select a reasonable threshold both for the streaming and the interactive cases, at T3 those are max 4 streaming operations and max ~100 concurrent interactive operations. Further I/O requests will get queued in their specific I/O queue and they won't start until an I/O slot won't get available. A T3 user will notice these "stuck" cases because his/her file request won't start like usual. If so, write immediately to cms-tier3 AT lists.psi.ch because 99% of the times that will be an error.More than a Grid protocol can be mapped to a same I/O queue by the dCache administrator ; for instance at T3 the dcap gsidcap Grid protocols use the same interactive I/O queue regular ; a such overlap is made to mitigate the lack of a comprehensive I/O queue limits system since it implicitly implements the limit "max 100 dcap OR gsidcap connections" in the I/O queue regular associated to a single filesystem ; ideally a dCache administrator would create an I/O queue for each Grid protocol instead and he would define a list of constraints involving more than a single I/O queue.

Presently dCache CAN'T enforce limits like :

  1. max active I/O slots per filesystem involving all the several I/O queues using that filesystem ; it's only possible to define an isolated max active I/O slots per I/O queue
  2. max active I/O user slots for a specific I/O queue
  3. max active I/O user slots for all the I/O queues with the same name
  4. max active I/O user slots for all the I/O queues
  5. max active I/O user space in /pnfs
all these inapplicable limits mean that a single misbehaving user will globally affect the T3 SE service ; especially the case 2. is occurred more than once in the early past. Only the T3 administrators will be able to fix these cases by usually identifying the culprit, killing his/her computational jobs and explaining what was wrong.

The I/O queues system and the Grid protocols mapping

Grid protocol t3server Filesystem I/O queue Max active slots for that I/O queue Grid protocol/t3server endpoint reachable from Internet?
dcap t3se01.psi.ch regular 100 No
gsidcap t3se01.psi.ch regular 100 No
root t3se01.psi.ch wan 4 Yes
gsiftp t3se01.psi.ch wan 2 Yes
srm ( i.e. again gsiftp ) t3se01.psi.ch wan 2 Yes
dcap t3dcachedb03.psi.ch none 0 No
gsidcap t3dcachedb03.psi.ch none 0 No
root t3dcachedb03.psi.ch regular 100 No
gsiftp t3dcachedb03.psi.ch none 0 No
srm ( i.e. again gsiftp ) t3dcachedb03.psi.ch none 0 No

by the following watch command ( to be executed on a t3ui server ) we can observe both the filesystems and their several I/O queues ; the Movers column reports the sums of the several Restores Stores P2P-Server P2P-Client regular wan xrootd Active/Max/Queued counters ; each T3 user is affected only by the regular wan xrootd traffic :

$ watch --interval=1 --differences 'lynx -dump -width=800 http://t3dcachedb.psi.ch:2288/queueInfo  | grep -v __________ | grep -v ops ' 


Copying a dir between two GridFTP server - serial method

The globus-url-copy tool can copy file, files and recursively ( but serially ) a whole dir from a GridFTP server to another ; the file transfer will occur directly between the two GridFTP servers ; you'll have to know the absolute paths both on the sender and the receiver side ; in the next example we're going to copy the dir :
  • gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
  • into :
  • gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/
the path prefix /gpfs/ddn/srm/cms/ has been discovered by a uberftp gsiftp://stormgf2.pi.infn.it session ; if you're in doubt contact the T3 administrators and we'll help you to identify this kind of prefixes ; at T3 / T2 the absolute paths are always respectively /pnfs/psi.ch/cms and /pnfs/lcg.cscs.ch/cms

the dir copy example :

$ globus-url-copy -continue-on-error -rst -nodcau -fast -vb -v -cd -r gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/ gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/

Source: gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
Dest:   gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/

Source: gsiftp://stormgf2.pi.infn.it/gpfs/ddn/srm/cms/store/user/arizzi/VHBBHeppyV12/
Dest:   gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/martinelli_f/VHBBHeppyV12/

Copying a dir between two GridFTP servers by GNU parallel

The tools globus-url-copy, uberftp, GNU parallel can be used together to copy, in parallel, a dir between two GridFTP servers, in this example a C.Galloni /pnfs dir into a MDefranc /pnfs dir ; no files will be routed trough the server running the globus-url-copy commands itself ( e.g. your UI, or a WN ) ; furthermore, since in a Grid environment each GridFTP server often acts as a transparent proxy to more than a GridFTP server, the copies will occur between a matrix 2x2 of GridFTP servers ; a bottleneck in the parallelism might occur due to the limited bandwidth available between the 2 data centres more than to the total amount of GridFTP servers involved. It's not compulsory but we recommend to run all the globus-url-copy commands in a screen -L session to avoid to get interrupted the copies just because of a connection cut to the server where you've started them ; anyway it's safe to repeat the same globus-url-copy commands over and over again.
Copying a T3 /pnfs dir into another T3 /pnfs dir ( use case requested by the users just once )
1st of all we'll generate the globus-url-copy commands to be passed as input to GNU parallel ; we'll save them into the file tobecopied ; afterward we'll started them in parallel ; we can arbitrarily choose how many parallel globus-url-copy commands to run by the GNU parallel parameter -j N ; each globus-url-copy command will consume a CPU core on the server on which you're running it so don't set a -j parameter greater than the amount of CPU cores there available :
$ uberftp -ls -r gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user/cgalloni//RunII/Ntuple_080316/ | grep .root$  | awk {' print "globus-url-copy -v -cd gsiftp://t3se01.psi.ch/"$8" gsiftp://t3se01.psi.ch/"$8}' | sed 's/cgalloni/mdefranc/2' > tobecopied
$ # 10 parallel globus-url-copy 
$ cat tobecopied | parallel -j 10       

Copying a T2 /pnfs dir into a T3 /pnfs dir ( recurring use case )
Because this time the source site is different from the destination site we can increase the GNU parallel parameter from -j 10 to, for instance, -j 30 ; for a copy from a T1/T2 to a T2 you might set -j 50 ; regrettably it's impossible for an ordinary user to compute the correct -j ; again you might want to start the copies by a screen -L session, but it's not compulsory.
$ uberftp -ls -r gsiftp://storage01.lcg.cscs.ch//pnfs/lcg.cscs.ch/cms/trivcat/store/user/cgalloni/Ntuple_290216/WJetsToQQ_HT-600ToInf_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/ | grep .root$  | awk {' print "globus-url-copy -v -cd gsiftp://storage01.lcg.cscs.ch//"$8" gsiftp://t3se01.psi.ch/"$8}' | sed 's/cgalloni/mdefranc/2' | sed 's/lcg.cscs.ch/psi.ch/3' > tobecopied
$ # 30 parallel globus-url-copy 
$ cat tobecopied | parallel -j 30

-- NinaLoktionova - 2018-11-20

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2018-11-22 - NinaLoktionova
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback