dCache documentation snippets from mails, etc.
Good external links to documentation and tools
useful Chimera DB commands
amount of used space of a given pool
select sum(isize) from t_inodes where ipnfsid in (select ipnfsid from t_locationinfo where ilocation='POOL-NAME');
check how much replica's you have:
select count(ipnfsid), ipnfsid from t_locationinfo group by ipnfsid having count(ipnfsid) > 1;
log4j
Links for dcache log4j logging to SYSLOG
From Xavier Mol:
Rhese log4j commands are not defined in the "normal" dCache cells. You have to cd into the System section of the domain hosting the gridftp-door-cell. In your case this should like someway like this:
[t2-srm-02.lnl.infn.it] (local) admin > cd System@GFTP-t2-gftp-01Domain
# Now all log4j commands should be available:
[t2-srm-02.lnl.infn.it] (System@GFTP-t2-gftp-01Domain) admin > help log4j
log4j appender ls
log4j appender set -|OFF|FATAL|ERROR|WARN|INFO|DEBUG|ALL
log4j logger ls [-a]
log4j logger attach
log4j logger detach
GRIDFTP door dying with Invalid argument exception
Mail exchange between Ivano Talamo and Gerd Behrmann ("gridftp door dying". 2010-06-02):
25 May 2010 13:59:51 (GFTP-cmsrm-st12) [] Got an IO Exception ( closing
server ) : java.net.SocketException: Invalid argument
The nasty thing is that the service appears as running both with dcache
status and
in the web cell status page but no one is listening on the 2811 port and
the only solution is manually
restarting the service.
Both the nodes are running dcache is 1.9.-15.
* Gerd:
The typical reason to get the a
SocketException with the cryptic "Invalid argument" is because the process reached the file descriptor limit. For both pools and doors it is important that you raise the file descriptor limit in the operating system. The default limit is often 1024, and that is simply not enough when you need a file descriptor per socket and per open file. To raise the limit, create the file /opt/d-cache/jobs/dcache.local.sh with the content:
ulimit -n 32000
This will raise the limit before starting dCache (ie. you need to restart the door for the change to take effect).
manually setting a replica location for a file
in
PnfsManager there are the following two commands:
add file cache location
clear file cache location
best practice to cleanly stop a GRIDFTP door
from Gerd Behrmann, 2009-10-22:
If all transfers are through the srm, then all you need to do is to cd
to
LoginBroker and run 'disable
'. After that the srm will stop
generating turls for that door.
this was my original best guess:
Drain the door by setting login to zero for the door login manager cell in the admin shell:
cd GFTP-t3fs01
set max logins 0
One can see details about all transfers of that door by first listing its children
(GFTP-t3fs01) admin > get children
GFTP-t3fs01-Unknown-5671
GFTP-t3fs01-Unknown-5734
GFTP-t3fs01-Unknown-5676
GFTP-t3fs01-Unknown-5589
GFTP-t3fs01-Unknown-5817
GFTP-t3fs01-Unknown-5666
all these childrens also are well-known cells. One can enter them and use info to get details:
(local) admin > cd GFTP-t3fs01-Unknown-5671
(GFTP-t3fs01-Unknown-5671) admin > info
FTPDoor
/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=fronga/CN=482262/CN=Frederic Ronga
User Host : cithep213.ultralight.org
Local Host : t3fs01.psi.ch
Last Command : STOR //pnfs/psi.ch/cms/trivcat/store/user/fronga/ntuples/data/PhotonJet_Pt300/NTupleProducer_35X_MC35x_RECO_1_2.root
Command Count : 12
I/O Queue : wan
GFTP-t3fs01-Unknown-5671@gridftp-t3fs01Domain;p=GFtp-1;o=501/0;
15539;000200000000000000AF9F10;cithep213.ultralight.org;t3fs07_cms_1;mover 28581: receiving;99573947;
Mover status
A mover ls
in a pool cell will print a line like the following
30670 A R {DCap-storage01-unknow-16060@dcap-storage01Domain:10} 00020000000000000346D4D8 h={SM={a=1950673918;u=1950673918};S=None} bytes=116930640 time/sec=24 LM=0
- A: (A)ctive or (W)aiting transfer
- R: (R)unning or (H)eld
- {DCap-storage01-unknow-16060@dcap-storage01Domain:10}
- 00020000000000000346D4D8: pnfs ID
- h={SM={a=1950673918;u=1950673918};S=None}
- bytes=116930640: Bytes read/sent up to this point
- time/sec=24: time since mover creation(?)
- LM=0
To which queue does a particular active mover belong to?
on a pool you can do
mover ls
(note pnfsid)
then you can do :
queue ls queue -l
(and find out in what queue the pnfsid is)
May be there is an easier method, this something that comes to mind.
Dmitry Litvintsev,
On the difference of setting a pool to readonly in the PoolManager and in a Pool Cell
From a mail by Gerd Behrmann, 2009-08-07:
When the rdonly status is set in the PoolManager, only PoolManager controlled transfers are prevented to write to the pools. The pools themselves are still write enabled and the migration module happily copies files to them. This is actually a feature, as you can stop production transfers from writing to the pools, but still use the migration module to move files to them.
As Lionel suggested, marking the pool rdonly on the pool itself will prevent all writes. The migration module will still select the pool as a destination, but the write will fail and other pools are selected instead. The source file will not be deleted/modified until the file was successfully copied to another pool.
On the moving of pinned files and the different meanings of the sticky flag
Please note that the X in rep ls does not mean pinned. It means sticky. Pinned files have a sticky bit, yes, but pinning is not the only reason that a file is sticky. Disk only files are cached + sticky, with the sticky flagged being owned by "system". Pinned files on the other hand have a sticky flag owned by "PinManager-ID" where ID is a pin specific ID.
What you did was to COPY all sticky files (no matter whether they are pinned or not).
The -pins=move does mean that the sticky bits used for pinning (and only those for pinning) are moved to the target pool. This involved negotiating the move with the pin manager as the pin manager database needs to be updated. All the other sticky bits were copied. It is because of the pin manager database that there is a special option for what to do with pins; sticky bits used for pinning are newer copied as that would break pin manager.
For disk only files with a sticky bit owner of "system", the source was untouched because you asked for a copy. Had you done a move command then the source files would have been removed.
In general using 'migration move' when you want to move files is always better. It has safe defaults which guarantee that data has indeed been moved. It also has defaults to deal with moving of pins. Don't use 'migration copy' unless you actually want several copies of a file.
If what you really wanted to do was to move all disk only files, then the correct thing to do would be:
migration move -state=cached -sticky=system -concurrency=10 -exclude=heplns194_1 -exclude-when=target.removable<500G -target=pgroup -- cms-pgroup
If you have access latency and retention policy configured on your system you could also have done
migration move -al=ONLINE -rp=REPLICA -concurrency=10 -exclude=heplns194_1 -exclude-when=target.removable<500G -target=pgroup -- cms-pgroup
If on the other hand you only wanted to move cached tape file you could possibly have done
migration move -al=CUSTODIAL -rp=NEARLINE -state=cached -concurrency=10 -exclude=heplns194_1
-exclude-when=target.removable<500G -target=pgroup -- cms-pgroup
Cheers,
/gerd
Removing cached only files from a pool
Fr, 2009-06-26:
to remove cached files on a pool,then simply type
'sweeper purge' in pools admin prompt. Cached copies will be
removed by dcache when space is needed.
Regards,
Tigran.
File Flags for Pool Files (P, C, X,...)
G. Behrmann:
- Since 1.8.0, PRECIOUS means "must go to tape".
- C is short for CACHED, but nowadays this just means "not precious" hence "this copy doesn't go to tape".
- The X means sticky. REPLICA+ONLINE files are marked CACHED+STICKY with the lifetime on the sticky flag being infinity.
- CUSTODIAL+NEARLINE files are marked PRECIOUS, but they become CACHED once the file has been flushed to tape. If the files is pinned afterwards, the file becomes CACHED+STICKY, but with a lifetime on the sticky flag corresponding to the lifetime of the pin.
Even better information from the dcache trac: http://trac.dcache.org/projects/dcache/wiki/manuals/RepLsOutput
<CPCScsRDXEL>
|||||||||||
||||||||||+-- (L) File is locked (currently in use)
|||||||||+--- (E) File is in error state
||||||||+---- (X) File is pinned (aka "sticky")
|||||||+----- (D) File is in process of being destroyed
||||||+------ (R) File is in process of being removed
|||||+------- (s) File sends data to back end store
||||+-------- (c) File sends data to client (dcap,ftp...)
|||+--------- (S) File receives data from back end store
||+---------- (C) File receives data from client (dcap,ftp)
|+----------- (P) File is precious
+------------ (C) File is cached
LOCK-TIME :
The number of milli-seconds, this file will still be locked. Please note that this is an internal lock and not the pin-time (SRM).
OPEN-COUNT :
Number of clients, currently reading this file.
SIZE : File size
STORAGE-CLASS : The storage class of this file.
From a mail by Gerd Behrmann (2009-07-06):
- What has changed between 1.7 and 1.8 is that both tape and disk only files previously were marked precious; in 1.8 tape files are precious when they are not on tape yet, and REPLICA+ONLINE files are marked cached+sticky (the C+X combination). If you have not configured space management (or rather, haven't configured the proper tags for this in PNFS), disk only files are still precious. Files that were migrated to tape, recalled from tape, or copied to other pools due to load are cached (i.e. without the X).
- The sticky flag (the X in rep ls) is really a LIST of sticky flags. As soon as the list is non-empty, the file is considered sticky (i.e. the X is set) and will not be garbage collected. Each sticky flag in the list has an ID (aka owner) and a lifetime. The lifetime may be infinite (that's when you don't specify it). For REPLICA+ONLINE files the owner is 'system' and the lifetime is infinite (i.e. -1).
Extending space token lifetime
This is not yet possible by a admin command. Requires direct modification in the DB.
From a mail by Dmitry Litvintsev, 2009-07-24:
I agree that this is definciency. I can add this admin function of course.
meanwhile the easiest thing to do to execute something like this from
database:
update srmspace set lifetime=10*lifetime where id=ID;
where ID is space reservation id in question.
Checksums
G. Behrmann: What you can do is to go to the pool containing the file. Then you issue 'csm check ' followed by 'csm info' to check the result.
The storageinfoof
command in the PoolManager lists among other details also the checksum and its type.
flag-c=1:cbda338e
^ ^
| |
| check sum value
check sum type 1 = alder, 2 = md5, 3 = md4
dCache authorization
http://www.dcache.org/manuals/workshop-aachen-2009/4_AccessControl_dCache.pdf
explanation of how the SRM cell gathers information for evaluating an srmls requests
From Dmitry Litvintsev on a request from Lionel Schwarz (2009-06-11, user-forum):
Hi Lionel,
SRM normally :
1) requests storageinfo from PnfsManager using path,
2) request metadata of parent directory.
3) checks that srm user can read file based on info from (1) and (2)
4) checks with PoolManager if file is cached (sending storageinfo and pnfsid to it)
Request to PoolManager (4) contains:
DirectionType accessType=READ
String storeUnitName=storage_info.getStorageClass()+"@"+storage_info.getHsm()
String dCacheUnitName=storage_info.getCacheClass()
String protocolUnitName="*/*"
String netUnitName=srm server host
after that it sets the _locality_ following this logic:
if file is cached and stored , locality is TFileLocality.ONLINE_AND_NEARLINE;
if file is cached and not stored, locality is TFileLocality.ONLINE;
is file is not cached and stored locality is TFileLocality.NEARLINE;
if file is not cached and not stored locality is TFileLocality.UNAVAILABLE;
if file is directory locality is TFileLocality.NONE;
SRM configuration variables
From a message from Timur (Feb 2008):
SRM srmPrepareToGet
and srmBringOnline
Requests are executed by the threads in the pool, srmGetReqThreadPoolSize
specifies maximum number of
such threads. When all the threads are busy, the rest of the requests are put on the queue. The maximum number of the elements in the
queue is specified by srmGetReqThreadQueueSize
. Once the files are prepared for reading, permissions are verified, files are staged, the
files status is changed to Ready and the TURL is given to the user. in order to limit the load of the system the maximum number of such
requests is limited to srmGetReqMaxReadyRequests
. The rest of the requests that are almost ready, except that all the transfer slots
(controlled by srmGetReqMaxReadyRequests
) are occupied, are put on the ready queue, the max length of the queue is controlled by
srmGetReqReadyQueueSize
. If the execution of the request fails with non fatal error, the request is retried after the the retry timout, the
timeout time in milliseconds in controlled by srmGetReqRetryTimeout
. If the request execution is retried srmGetReqMaxNumberOfRetries
times
and execution still fails the request is failed and the error is propagated to the client. In order to implement fairness, we have parameter
srmGetReqMaxNumOfRunningBySameOwner
. If this number of the requests by the same user is already running, and the next request in the queue
belongs to the same user again, and there are requests further in the queue, that belong to a different user, these requests will be executed.
Almost the same meaning appries to put and copy variables. Only they control execution of the srmPrepareToPut
and srmCopy
requests.
New PnfsManager configuration options
From a user forum mail from Gerd Behrmann 2009-05-26
Gerard Bernabeu wrote:
> Hello,
>
> recently dCacheSetup added a few new parameters and I'd like to know
> about your experience with them and what values you've there.
>
> *pnfsNumberOfThreadGroups *
>
> Thanks to the helpful explanation this is not hard to tune to "ps faux |
> grep postgres | grep -v grep | grep data[0-9] | wc -l".
> Have you tried/benchmarked it?
What happens is that a request for database n is assigned to thread
group n % pnfsNumberOfThreadGroups (i.e. database id module number of
thread groups). You want to choose a number such that the busy databases
get assigned to different thread groups.
>
> *pnfsNumberOfLocationThreads*
>
> If we've ~10 pnfsNumerOfThreadGroups, what do you think the ideal value
> is here? 2? 10?
> Have you tried/benchmarked it?
2-3 should be just fine. The companion lookups are really fast, so there
is likely no need to have a lot of queues for this. The important part
is to get the requests moved away from the regular PNFS queues.
>
> *pnfsNumberOfThreads*
>
> With pnfsNumberOfLocationThreads and pnfsNumberOfThreadGroups disabled
> (by default) we're currently using 10 here, and so far it works OK
> (besides no serious benchmarking done).
> If we set pnfsNumerOfThreadGroups to 16 (current number of DBs at PIC),
> how much would you lower pnfsNumberOfThreads?
I would probably go for 2 to 4 threads per thread group. A PNFS database
is single threaded, so it cannot process more than one request at a time
anyway. Using more than one thread makes sense because you can achieve a
little concurrency in the internal processing in PnfsManager and also in
the pnfsd frontend in front of the PNFS database.
Please make sure you got enough pnfsd instances - otherwise having so
many threads in the PNFS manager doesn't make much sense.
> *pnfsFolding*
>
> This looks like a safe improvement to me, have you tried this new
> option? Any benchmarks?
It is used at NDGF, FZK and Fermi and seems safe.
>
> *log slow threshold*
>
> Any info on what it does? Any experience?
Same as a similar named setting in postgresql: Once a request in
PnfsManager takes more than the time you specified, the request is
logged in the log file.
We don't have any benchmarks at all. If you do benchmarking, then notice
that many of the above changes have no effect until the system gets
under high load.
E.g. pnfs folding will not fold anything unless you got lots of queuing
in PnfsManager and multiple requests for the same meta data (e.g.
multiple uploads to the same directory).
Thread groups have no effect unless you got lots of requests to
different PNFS databases (in that case you should observe increased
concurrency) - and only if you actually got enough pnfsd instances idle
(which may not be the case if you got lots of FTP doors doing listing at
the same time).
Notice that 1.9.0-10, 1.9.1-5 to -7, and all 1.9.2 earlier than -6 have
a bug in the thread group implementation: This bug causes an increased
load on the pnfs backend. You do however suffer this overhead no matter
whether you have a single thread group or several thread groups.
Automatic Replication of hot files
There is a newer article by Paul Millar on this: Improved Pool to pool tuning in 1.9.5
Mail from antonio.delgado.peris , 2010-02-18
We have p2p oncost activated and it seems to work (or it did before our just-finished upgrade). Before get it to work we had to perform some tests and read documentation carefully (that part was not trivial...) to understand how it really worked. So, if nothing has changed, I think the rules are:
- The p2p=2.0 value means that the pool to pool replication will start when the load factor on the pool is greater than 2.0.
- For a read/write pool (no matter if you have tape or not), that load factor is calculated as:
[Addition of all transfers(active+waiting) in (read, write, restore)] / (3 * Max-transfers-allowed)
- For a read-only pool, then it would be:
[Addition of all transfers(active+waiting) in (read, restore)] / (2 * Max-transfers-allowed)
So, for p2p=2.0, and a maximum of 100 transfers, the p2p transfer would start when you have more than 600 concurrent transfers of any kind on a given pool, which maybe has not occurred so far. As a reference, in our site, the p2p factor is set to 0.1.
Cheers,
Antonio.
viewing the pool cost calculation
The PoolManager cell prints out a cost calculation for every pool to its pinboard:
[storage02.lcg.cscs.ch] (PoolManager) admin > show pinboard
10.02.49 [writeHandler] [] queryPoolsForCost : costModule : ibm02_data7_atlas (0) ibm02_data7_atlas={Tag={{hostname=se42}};size=0;SC=2.1787723858082185E-5;CC=0.0050;}
10.02.49 [writeHandler] [] queryPoolsForCost : costModule : ibm01_data2_atlas (0) ibm01_data2_atlas={Tag={{hostname=se41}};size=0;SC=2.2317136073855644E-5;CC=0.0;}
-- DerekFeichtinger - 12 May 2009