basic understanding of the dCache for advanced user (page created by Joosep)
The Storage Element ( SE )
t3se01.psi.ch
runs
dCache, a Grid Storage middleware which transparently combines together the space made available by tens of fileservers in a single namespace called
/pnfs
; on the top of it
dCache offers the Grid protocols
dcap gsidcap root srm gsiftp
in order to allow the Grid tools
lcg-cp xrdcp srm-cp dccp gfal-copy ...
to upload/download files into/from this single namespace. Whenever a new file get written in
/pnfs
either by a T3 user or by the
PhEDEx service
dCache randomly selects a filesystem to host that new file, it doesn't matter the Grid protocol, the Grid tool, the user or the server used to upload the file itself ; typically more than a single file it's written in a
/pnfs
subdir and accordingly all the files inside that subdir will be randomly spread over all the available filesystems ; this files distribution implements a load-balancing among all the available filesystems and avoid any I/O bottleneck. Along the time newer, bigger and faster filesystems and fileservers replace their older peers but all these operations are transparently performed behind the scenes by a
dCache administrator ; the T3 users won't notice these maintenances but they will be affected by a
dCache SW upgrade or by a major fault occurred to a fileserver
dCache it's not the first middleware aggregating heterogeneous fileservers together ( e.g. look
Gluster or
OrangeFS ) nor probably the best one ( e.g. it can't split a file in distributed chunks like
Gluster ) but it supports very well the Grid context ( VOMS authorizations, X509s management, Space Token support, Grid protocols, ... ) so it's a good choice for our specific needs.
The Grid protocols/Grid tools versatility offered by a
dCache setup often confuses the new T3 user since it's not always well integrated with 3rd SW like ROOT / hadd / CMSSW, it behaves differently if the file access comes from the T3 LAN or from a remote Internet site ( WAN access ), if the CMSSW environment is loaded or not, if the CRAB environment is loaded or not, and it's processed with different policies set by the T3 admins, so a nonnegligible learning period is requested in order to grasp all these protocols, tools and corner cases.
In order to use the T3 SE service at its bestest it's needed at least a basic understanding of the
dCache internals ( a comparable effort is needed to properly use a new batch system ) ; the basic unit of a
dCache setup is a single filesystem, necessarily hosted inside a single fileserver ; by its nature, each filesystem can sustain a certain amount of concurrent streaming operations, like downloading a 1GB
.root
file, and a higher amount of concurrent interactive operations, like opening a
.root
file from a batch job to read a fraction of it, do some computing, and after a while read another fraction. To differentiate these I/O cases
dCache offers a
FIFO I/O queue system per filesystem. It's up to the
dCache administrator to select a reasonable threshold both for the streaming and the interactive cases, at T3 those are max 4 streaming operations and max ~100 concurrent interactive operations. Further I/O requests will get queued in their specific I/O queue and they won't start until an I/O slot won't get available. A T3 user will notice these "stuck" cases because his/her file request won't start like usual. If so, write immediately to
cms-tier3 AT lists.psi.ch
because 99% of the times that will be an error.More than a Grid protocol can be mapped to a same I/O queue by the
dCache administrator ; for instance at T3 the
dcap gsidcap
Grid protocols use the same interactive I/O queue
regular
; a such overlap is made to mitigate the lack of a comprehensive I/O queue limits system since it implicitly implements the limit "max 100 dcap OR gsidcap connections" in the I/O queue
regular
associated to a single filesystem ; ideally a
dCache administrator would create an I/O queue for each Grid protocol instead and he would define a list of constraints involving more than a single I/O queue.
Presently
dCache CAN'T enforce limits like :
- max active I/O slots per filesystem involving all the several I/O queues using that filesystem ; it's only possible to define an isolated max active I/O slots per I/O queue
- max active I/O user slots for a specific I/O queue
- max active I/O user slots for all the I/O queues with the same name
- max active I/O user slots for all the I/O queues
- max active I/O user space in
/pnfs
all these inapplicable limits mean that a single misbehaving user will globally affect the T3 SE service ; especially the case 2. is occurred more than once in the early past. Only the T3 administrators will be able to fix these cases by usually identifying the culprit, killing his/her computational jobs and explaining what was wrong.
The I/O queues system and the Grid protocols mapping
by the following
watch
command ( to be executed on a
t3ui
server ) we can observe both the filesystems and their several I/O queues ; the
Movers
column reports the sums of the several
Restores Stores P2P-Server P2P-Client regular wan xrootd
Active/Max/Queued counters ; each T3 user is affected only by the
regular wan xrootd
traffic :
$ watch --interval=1 --differences 'lynx -dump -width=800 http://t3dcachedb.psi.ch:2288/queueInfo | grep -v __________ | grep -v ops '
--
NinaLoktionova - 2018-11-20