dCache documentation snippets from mails, etc.

dCache documentation snippets from mails, etc.

Good external links to documentation and tools

dcache FAQ on the official trac wiki
WLCG notes on dCache
ItalianT2Tools by Subir Sarkar
WebSVN access to dCache source code
Presentation on Mover Queues and Transfer Parameters (dcache Manuals area)

useful Chimera DB commands

amount of used space of a given pool

select sum(isize) from t_inodes where ipnfsid in (select ipnfsid from t_locationinfo where ilocation='POOL-NAME');

check how much replica's you have:

select count(ipnfsid), ipnfsid from t_locationinfo group by ipnfsid having count(ipnfsid) > 1;

log4j

Links for dcache log4j logging to SYSLOG

From Xavier Mol:
Rhese log4j commands are not defined in the "normal" dCache cells. You have to cd into the System section of the domain hosting the gridftp-door-cell. In your case this should like someway like this:

[t2-srm-02.lnl.infn.it] (local) admin > cd System@GFTP-t2-gftp-01Domain

#  Now all log4j commands should be available:

[t2-srm-02.lnl.infn.it] (System@GFTP-t2-gftp-01Domain) admin > help log4j
log4j appender ls
log4j appender set  -|OFF|FATAL|ERROR|WARN|INFO|DEBUG|ALL
log4j logger ls [-a]
log4j logger attach  
log4j logger detach

GRIDFTP door dying with Invalid argument exception

Mail exchange between Ivano Talamo and Gerd Behrmann ("gridftp door dying". 2010-06-02):

25 May 2010 13:59:51 (GFTP-cmsrm-st12) [] Got an IO Exception ( closing server ) : java.net.SocketException: Invalid argument The nasty thing is that the service appears as running both with dcache status and in the web cell status page but no one is listening on the 2811 port and the only solution is manually restarting the service. Both the nodes are running dcache is 1.9.-15.

* Gerd: The typical reason to get the a SocketException with the cryptic "Invalid argument" is because the process reached the file descriptor limit. For both pools and doors it is important that you raise the file descriptor limit in the operating system. The default limit is often 1024, and that is simply not enough when you need a file descriptor per socket and per open file. To raise the limit, create the file /opt/d-cache/jobs/dcache.local.sh with the content:

ulimit -n 32000

This will raise the limit before starting dCache (ie. you need to restart the door for the change to take effect).

manually setting a replica location for a file

in PnfsManager there are the following two commands:

add file cache location  
clear file cache location

best practice to cleanly stop a GRIDFTP door

from Gerd Behrmann, 2009-10-22:
If all transfers are through the srm, then all you need to do is to cd to LoginBroker and run 'disable '. After that the srm will stop generating turls for that door.

this was my original best guess:
Drain the door by setting login to zero for the door login manager cell in the admin shell:

cd GFTP-t3fs01
set max logins 0

One can see details about all transfers of that door by first listing its children

(GFTP-t3fs01) admin > get children
GFTP-t3fs01-Unknown-5671
GFTP-t3fs01-Unknown-5734
GFTP-t3fs01-Unknown-5676
GFTP-t3fs01-Unknown-5589
GFTP-t3fs01-Unknown-5817
GFTP-t3fs01-Unknown-5666

all these childrens also are well-known cells. One can enter them and use info to get details:

(local) admin > cd GFTP-t3fs01-Unknown-5671
(GFTP-t3fs01-Unknown-5671) admin > info
            FTPDoor
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=fronga/CN=482262/CN=Frederic Ronga
    User Host  : cithep213.ultralight.org
   Local Host  : t3fs01.psi.ch
 Last Command  : STOR //pnfs/psi.ch/cms/trivcat/store/user/fronga/ntuples/data/PhotonJet_Pt300/NTupleProducer_35X_MC35x_RECO_1_2.root
 Command Count : 12
     I/O Queue : wan
GFTP-t3fs01-Unknown-5671@gridftp-t3fs01Domain;p=GFtp-1;o=501/0;
15539;000200000000000000AF9F10;cithep213.ultralight.org;t3fs07_cms_1;mover 28581: receiving;99573947;

Mover status

A mover ls in a pool cell will print a line like the following

30670 A R {DCap-storage01-unknow-16060@dcap-storage01Domain:10} 00020000000000000346D4D8 h={SM={a=1950673918;u=1950673918};S=None} bytes=116930640 time/sec=24 LM=0

A: (A)ctive or (W)aiting transfer
R: (R)unning or (H)eld
{DCap-storage01-unknow-16060@dcap-storage01Domain:10}
00020000000000000346D4D8: pnfs ID
h={SM={a=1950673918;u=1950673918};S=None}
bytes=116930640: Bytes read/sent up to this point
time/sec=24: time since mover creation(?)
LM=0

To which queue does a particular active mover belong to?

on a pool you can do

mover ls

(note pnfsid)

then you can do :

queue ls queue -l

(and find out in what queue the pnfsid is)

May be there is an easier method, this something that comes to mind.

Dmitry Litvintsev,

On the difference of setting a pool to readonly in the PoolManager and in a Pool Cell

From a mail by Gerd Behrmann, 2009-08-07:

When the rdonly status is set in the PoolManager, only PoolManager controlled transfers are prevented to write to the pools. The pools themselves are still write enabled and the migration module happily copies files to them. This is actually a feature, as you can stop production transfers from writing to the pools, but still use the migration module to move files to them.

As Lionel suggested, marking the pool rdonly on the pool itself will prevent all writes. The migration module will still select the pool as a destination, but the write will fail and other pools are selected instead. The source file will not be deleted/modified until the file was successfully copied to another pool.

On the moving of pinned files and the different meanings of the sticky flag

Please note that the X in rep ls does not mean pinned. It means sticky. Pinned files have a sticky bit, yes, but pinning is not the only reason that a file is sticky. Disk only files are cached + sticky, with the sticky flagged being owned by "system". Pinned files on the other hand have a sticky flag owned by "PinManager-ID" where ID is a pin specific ID. What you did was to COPY all sticky files (no matter whether they are pinned or not). The -pins=move does mean that the sticky bits used for pinning (and only those for pinning) are moved to the target pool. This involved negotiating the move with the pin manager as the pin manager database needs to be updated. All the other sticky bits were copied. It is because of the pin manager database that there is a special option for what to do with pins; sticky bits used for pinning are newer copied as that would break pin manager.

For disk only files with a sticky bit owner of "system", the source was untouched because you asked for a copy. Had you done a move command then the source files would have been removed.

In general using 'migration move' when you want to move files is always better. It has safe defaults which guarantee that data has indeed been moved. It also has defaults to deal with moving of pins. Don't use 'migration copy' unless you actually want several copies of a file. If what you really wanted to do was to move all disk only files, then the correct thing to do would be:

migration move -state=cached -sticky=system -concurrency=10 -exclude=heplns194_1 -exclude-when=target.removable<500G -target=pgroup -- cms-pgroup

If you have access latency and retention policy configured on your system you could also have done

migration move -al=ONLINE -rp=REPLICA -concurrency=10 -exclude=heplns194_1 -exclude-when=target.removable<500G -target=pgroup -- cms-pgroup

If on the other hand you only wanted to move cached tape file you could possibly have done

migration move -al=CUSTODIAL -rp=NEARLINE -state=cached -concurrency=10 -exclude=heplns194_1 
               -exclude-when=target.removable<500G -target=pgroup -- cms-pgroup

Cheers,

/gerd

Removing cached only files from a pool

Fr, 2009-06-26:

to remove cached files on a pool,then simply type
'sweeper purge' in pools admin prompt. Cached copies will be
removed by dcache when space is needed.

Regards,
        Tigran.

File Flags for Pool Files (P, C, X,...)

G. Behrmann:

Since 1.8.0, PRECIOUS means "must go to tape".
C is short for CACHED, but nowadays this just means "not precious" hence "this copy doesn't go to tape".
The X means sticky. REPLICA+ONLINE files are marked CACHED+STICKY with the lifetime on the sticky flag being infinity.
CUSTODIAL+NEARLINE files are marked PRECIOUS, but they become CACHED once the file has been flushed to tape. If the files is pinned afterwards, the file becomes CACHED+STICKY, but with a lifetime on the sticky flag corresponding to the lifetime of the pin.

Even better information from the dcache trac: http://trac.dcache.org/projects/dcache/wiki/manuals/RepLsOutput

<CPCScsRDXEL>
 |||||||||||
 ||||||||||+--  (L) File is locked (currently in use)
 |||||||||+---  (E) File is in error state
 ||||||||+----  (X) File is pinned (aka "sticky")
 |||||||+-----  (D) File is in process of being destroyed
 ||||||+------  (R) File is in process of being removed
 |||||+-------  (s) File sends data to back end store
 ||||+--------  (c) File sends data to client (dcap,ftp...)
 |||+---------  (S) File receives data from back end store
 ||+----------  (C) File receives data from client (dcap,ftp)
 |+-----------  (P) File is precious
 +------------  (C) File is cached

LOCK-TIME : 
 The number of milli-seconds, this file will still be locked. Please note that this is an internal lock and not the pin-time (SRM). 

 OPEN-COUNT : 
 Number of clients, currently reading this file. 

 SIZE : File size 

 STORAGE-CLASS : The storage class of this file.

From a mail by Gerd Behrmann (2009-07-06):

What has changed between 1.7 and 1.8 is that both tape and disk only files previously were marked precious; in 1.8 tape files are precious when they are not on tape yet, and REPLICA+ONLINE files are marked cached+sticky (the C+X combination). If you have not configured space management (or rather, haven't configured the proper tags for this in PNFS), disk only files are still precious. Files that were migrated to tape, recalled from tape, or copied to other pools due to load are cached (i.e. without the X).
The sticky flag (the X in rep ls) is really a LIST of sticky flags. As soon as the list is non-empty, the file is considered sticky (i.e. the X is set) and will not be garbage collected. Each sticky flag in the list has an ID (aka owner) and a lifetime. The lifetime may be infinite (that's when you don't specify it). For REPLICA+ONLINE files the owner is 'system' and the lifetime is infinite (i.e. -1).

Extending space token lifetime

This is not yet possible by a admin command. Requires direct modification in the DB.

From a mail by Dmitry Litvintsev, 2009-07-24:

I agree that this is definciency. I can add this admin function of course. 
meanwhile the easiest thing to do to execute something like this from 
database:

        update srmspace set lifetime=10*lifetime where id=ID;

where ID is space reservation id in question.

Checksums

G. Behrmann: What you can do is to go to the pool containing the file. Then you issue 'csm check ' followed by 'csm info' to check the result.

The storageinfoof command in the PoolManager lists among other details also the checksum and its type.

        flag-c=1:cbda338e
               ^    ^
               |    |
               |    check sum value
               check sum type 1 = alder, 2 = md5, 3 = md4

dCache authorization

http://www.dcache.org/manuals/workshop-aachen-2009/4_AccessControl_dCache.pdf

explanation of how the SRM cell gathers information for evaluating an srmls requests

From Dmitry Litvintsev on a request from Lionel Schwarz (2009-06-11, user-forum):

Hi Lionel, 

SRM normally :

        1) requests storageinfo from PnfsManager using path,
        2) request metadata of parent directory. 
        3) checks that srm user can read file based on info from (1) and (2) 
        4) checks with PoolManager if file is cached (sending storageinfo and pnfsid to it)

Request to PoolManager (4) contains:

DirectionType accessType=READ
String storeUnitName=storage_info.getStorageClass()+"@"+storage_info.getHsm()
String dCacheUnitName=storage_info.getCacheClass()
String protocolUnitName="*/*"
String netUnitName=srm server host

after that it sets the _locality_ following this logic:

if file is cached and stored , locality is TFileLocality.ONLINE_AND_NEARLINE;
if file is cached and not stored, locality is TFileLocality.ONLINE;
is file is not cached and stored locality is TFileLocality.NEARLINE;
if file is not cached and not stored locality is TFileLocality.UNAVAILABLE; 
if file is directory locality is TFileLocality.NONE;

SRM configuration variables

From a message from Timur (Feb 2008):

SRM srmPrepareToGet and srmBringOnline Requests are executed by the threads in the pool, srmGetReqThreadPoolSize specifies maximum number of such threads. When all the threads are busy, the rest of the requests are put on the queue. The maximum number of the elements in the queue is specified by srmGetReqThreadQueueSize. Once the files are prepared for reading, permissions are verified, files are staged, the files status is changed to Ready and the TURL is given to the user. in order to limit the load of the system the maximum number of such requests is limited to srmGetReqMaxReadyRequests. The rest of the requests that are almost ready, except that all the transfer slots (controlled by srmGetReqMaxReadyRequests) are occupied, are put on the ready queue, the max length of the queue is controlled by srmGetReqReadyQueueSize. If the execution of the request fails with non fatal error, the request is retried after the the retry timout, the timeout time in milliseconds in controlled by srmGetReqRetryTimeout. If the request execution is retried srmGetReqMaxNumberOfRetries times and execution still fails the request is failed and the error is propagated to the client. In order to implement fairness, we have parameter srmGetReqMaxNumOfRunningBySameOwner. If this number of the requests by the same user is already running, and the next request in the queue belongs to the same user again, and there are requests further in the queue, that belong to a different user, these requests will be executed.

Almost the same meaning appries to put and copy variables. Only they control execution of the srmPrepareToPut and srmCopy requests.

New PnfsManager configuration options

From a user forum mail from Gerd Behrmann 2009-05-26

Gerard Bernabeu wrote:
> Hello,
> 
> recently dCacheSetup added a few new parameters and I'd like to know 
> about your experience with them and what values you've there.
> 
> *pnfsNumberOfThreadGroups *
> 
> Thanks to the helpful explanation this is not hard to tune to "ps faux | 
> grep postgres | grep -v grep | grep  data[0-9] | wc -l".
> Have you tried/benchmarked it?

What happens is that a request for database n is assigned to thread 
group n % pnfsNumberOfThreadGroups (i.e. database id module number of 
thread groups). You want to choose a number such that the busy databases 
get assigned to different thread groups.

> 
> *pnfsNumberOfLocationThreads*
> 
> If we've ~10 pnfsNumerOfThreadGroups, what do you think the ideal value 
> is here? 2? 10?
> Have you tried/benchmarked it?

2-3 should be just fine. The companion lookups are really fast, so there 
is likely no need to have a lot of queues for this. The important part 
is to get the requests moved away from the regular PNFS queues.

> 
> *pnfsNumberOfThreads*
> 
> With pnfsNumberOfLocationThreads and pnfsNumberOfThreadGroups disabled 
> (by default) we're currently using 10 here, and so far it works OK 
> (besides no serious benchmarking done).
> If we set pnfsNumerOfThreadGroups to 16 (current number of DBs at PIC), 
> how much would you lower pnfsNumberOfThreads?

I would probably go for 2 to 4 threads per thread group. A PNFS database 
is single threaded, so it cannot process more than one request at a time 
anyway. Using more than one thread makes sense because you can achieve a 
little concurrency in the internal processing in PnfsManager and also in 
the pnfsd frontend in front of the PNFS database.

Please make sure you got enough pnfsd instances - otherwise having so 
many threads in the PNFS manager doesn't make much sense.

>  *pnfsFolding*
 >
> This looks like a safe improvement to me, have you tried this new 
> option? Any benchmarks?

It is used at NDGF, FZK and Fermi and seems safe.

> 
> *log slow threshold*
 >
> Any info on what it does? Any experience?

Same as a similar named setting in postgresql: Once a request in 
PnfsManager takes more than the time you specified, the request is 
logged in the log file.

We don't have any benchmarks at all. If you do benchmarking, then notice 
that many of the above changes have no effect until the system gets 
under high load.

E.g. pnfs folding will not fold anything unless you got lots of queuing 
in PnfsManager and multiple requests for the same meta data (e.g. 
multiple uploads to the same directory).

Thread groups have no effect unless you got lots of requests to 
different PNFS databases (in that case you should observe increased 
concurrency) - and only if you actually got enough pnfsd instances idle 
(which may not be the case if you got lots of FTP doors doing listing at 
the same time).

Notice that 1.9.0-10, 1.9.1-5 to -7, and all 1.9.2 earlier than -6 have 
a bug in the thread group implementation: This bug causes an increased 
load on the pnfs backend. You do however suffer this overhead no matter 
whether you have a single thread group or several thread groups.

Automatic Replication of hot files

There is a newer article by Paul Millar on this: Improved Pool to pool tuning in 1.9.5

Mail from antonio.delgado.peris , 2010-02-18

We have p2p oncost activated and it seems to work (or it did before our just-finished upgrade). Before get it to work we had to perform some tests and read documentation carefully (that part was not trivial...) to understand how it really worked. So, if nothing has changed, I think the rules are:

- The p2p=2.0 value means that the pool to pool replication will start when the load factor on the pool is greater than 2.0.

- For a read/write pool (no matter if you have tape or not), that load factor is calculated as:
 [Addition of all transfers(active+waiting) in (read, write, restore)]  /  (3 * Max-transfers-allowed)

-  For a read-only pool, then it would be:
[Addition of all transfers(active+waiting) in (read, restore)]  /  (2 * Max-transfers-allowed)

So, for p2p=2.0, and a maximum of 100 transfers, the p2p transfer would start when you have more than 600 concurrent transfers of any kind on a given pool, which maybe has not occurred so far. As a reference, in our site, the p2p factor is set to 0.1.

Cheers,
  Antonio.

viewing the pool cost calculation

The PoolManager cell prints out a cost calculation for every pool to its pinboard:

[storage02.lcg.cscs.ch] (PoolManager) admin > show pinboard
10.02.49  [writeHandler] [] queryPoolsForCost : costModule : ibm02_data7_atlas (0) ibm02_data7_atlas={Tag={{hostname=se42}};size=0;SC=2.1787723858082185E-5;CC=0.0050;}
10.02.49  [writeHandler] [] queryPoolsForCost : costModule : ibm01_data2_atlas (0) ibm01_data2_atlas={Tag={{hostname=se41}};size=0;SC=2.2317136073855644E-5;CC=0.0;}

-- DerekFeichtinger - 12 May 2009

Topic revision: r32 - 2014-05-07 - DerekFeichtinger

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs