basic understanding of the dCache for advanced user (page created by Joosep)

The Storage Element ( SE ) t3se01.psi.ch runs dCache, a Grid Storage middleware which transparently combines together the space made available by tens of fileservers in a single namespace called /pnfs ; on the top of it dCache offers the Grid protocols dcap gsidcap root srm gsiftp in order to allow the Grid tools lcg-cp xrdcp srm-cp dccp gfal-copy ... to upload/download files into/from this single namespace. Whenever a new file get written in /pnfs either by a T3 user or by the PhEDEx service dCache randomly selects a filesystem to host that new file, it doesn't matter the Grid protocol, the Grid tool, the user or the server used to upload the file itself ; typically more than a single file it's written in a /pnfs subdir and accordingly all the files inside that subdir will be randomly spread over all the available filesystems ; this files distribution implements a load-balancing among all the available filesystems and avoid any I/O bottleneck. Along the time newer, bigger and faster filesystems and fileservers replace their older peers but all these operations are transparently performed behind the scenes by a dCache administrator ; the T3 users won't notice these maintenances but they will be affected by a dCache SW upgrade or by a major fault occurred to a fileserver dCache it's not the first middleware aggregating heterogeneous fileservers together ( e.g. look Gluster or OrangeFS ) nor probably the best one ( e.g. it can't split a file in distributed chunks like Gluster ) but it supports very well the Grid context ( VOMS authorizations, X509s management, Space Token support, Grid protocols, ... ) so it's a good choice for our specific needs.

The Grid protocols/Grid tools versatility offered by a dCache setup often confuses the new T3 user since it's not always well integrated with 3rd SW like ROOT / hadd / CMSSW, it behaves differently if the file access comes from the T3 LAN or from a remote Internet site ( WAN access ), if the CMSSW environment is loaded or not, if the CRAB environment is loaded or not, and it's processed with different policies set by the T3 admins, so a nonnegligible learning period is requested in order to grasp all these protocols, tools and corner cases.

In order to use the T3 SE service at its bestest it's needed at least a basic understanding of the dCache internals ( a comparable effort is needed to properly use a new batch system ) ; the basic unit of a dCache setup is a single filesystem, necessarily hosted inside a single fileserver ; by its nature, each filesystem can sustain a certain amount of concurrent streaming operations, like downloading a 1GB .root file, and a higher amount of concurrent interactive operations, like opening a .root file from a batch job to read a fraction of it, do some computing, and after a while read another fraction. To differentiate these I/O cases dCache offers a FIFO I/O queue system per filesystem. It's up to the dCache administrator to select a reasonable threshold both for the streaming and the interactive cases, at T3 those are max 4 streaming operations and max ~100 concurrent interactive operations. Further I/O requests will get queued in their specific I/O queue and they won't start until an I/O slot won't get available. A T3 user will notice these "stuck" cases because his/her file request won't start like usual. If so, write immediately to cms-tier3 AT lists.psi.ch because 99% of the times that will be an error.More than a Grid protocol can be mapped to a same I/O queue by the dCache administrator ; for instance at T3 the dcap gsidcap Grid protocols use the same interactive I/O queue regular ; a such overlap is made to mitigate the lack of a comprehensive I/O queue limits system since it implicitly implements the limit "max 100 dcap OR gsidcap connections" in the I/O queue regular associated to a single filesystem ; ideally a dCache administrator would create an I/O queue for each Grid protocol instead and he would define a list of constraints involving more than a single I/O queue.

Presently dCache CAN'T enforce limits like :

  1. max active I/O slots per filesystem involving all the several I/O queues using that filesystem ; it's only possible to define an isolated max active I/O slots per I/O queue
  2. max active I/O user slots for a specific I/O queue
  3. max active I/O user slots for all the I/O queues with the same name
  4. max active I/O user slots for all the I/O queues
  5. max active I/O user space in /pnfs
all these inapplicable limits mean that a single misbehaving user will globally affect the T3 SE service ; especially the case 2. is occurred more than once in the early past. Only the T3 administrators will be able to fix these cases by usually identifying the culprit, killing his/her computational jobs and explaining what was wrong.

The I/O queues system and the Grid protocols mapping

Grid protocol t3server Filesystem I/O queue Max active slots for that I/O queue Grid protocol/t3server endpoint reachable from Internet?
dcap t3se01.psi.ch regular 100 No
gsidcap t3se01.psi.ch regular 100 No
root t3se01.psi.ch wan 4 Yes
gsiftp t3se01.psi.ch wan 2 Yes
srm ( i.e. again gsiftp ) t3se01.psi.ch wan 2 Yes
dcap t3dcachedb03.psi.ch none 0 No
gsidcap t3dcachedb03.psi.ch none 0 No
root t3dcachedb03.psi.ch regular 100 No
gsiftp t3dcachedb03.psi.ch none 0 No
srm ( i.e. again gsiftp ) t3dcachedb03.psi.ch none 0 No

by the following watch command ( to be executed on a t3ui server ) we can observe both the filesystems and their several I/O queues ; the Movers column reports the sums of the several Restores Stores P2P-Server P2P-Client regular wan xrootd Active/Max/Queued counters ; each T3 user is affected only by the regular wan xrootd traffic :

$ watch --interval=1 --differences 'lynx -dump -width=800 http://t3dcachedb.psi.ch:2288/queueInfo  | grep -v __________ | grep -v ops ' 

-- NinaLoktionova - 2018-11-20

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2018-11-20 - NinaLoktionova
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback