documentation of PNFS related problems on our dcache installation
We began looking in more depth at the problem of root owned directories in PNFS, i.e. directories that were created during normal user requests, but that for some reason had the ownership root.root, leading to errors for user requests trying to write to these areas.
The problem became particularly bad shortly after upgrading from dCache 1.8.0-15p to 1.9.3, but we cannot really say whether the worsening of the condition has any connection to the new version. When additional service problems became apparent, we decided to rush into a migration of pnfs to chimera, which showed a number of other problems. All of this prompted us to examine out PNFS a little closer.
One case where root owned directories appear, and their behavior
seq 9|xargs -P9 -n1 --replace srmmkdir srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-{}
Return code: SRM_FAILURE
Explanation: srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/dircreate-9 not found ( .(id)(dircreate-9) )
$ ls -ld dircreate-9
drwxr-xr-x 1 root root 512 Sep 8 16:49 dircreate-9
$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory
$ chown cmsprd.cms /pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9
$ ls -ld dircreate-9
drwxr-xr-x 1 cmsprd cms 512 Sep 8 16:49 dircreate-9
$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory
Strange enough, when I checked the file 2 hours later, it had got a pnfs ID.
The settatr entry probably derives from my manual ownership change.
Let's compare with the entries of one of the successfully created directories: dircreate-1
09/08/09 16:49:43 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-000000000000
0000 name dircreate-1 : 000200000000000002BCD9B8 (0) -> 0
09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000
name .(pset)(000200000000000002BCD9B8)(attr)(0)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=0;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000000 (0) -> 0
09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000
name .(pset)(000200000000000002BCD9B8)(attr)(1)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=1;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000001 (0) -> 0
The pset lines seem to be exact copies of each other. In the case of the faulty "root owned" directory dircreate-9 bove, the pset line is missing completely.
Just did a second sequence of the whole test. This time two directories ended up root owned. But both of them had pnfs IDs. So, it seems that these symptoms are not necessarily coupled (maybe the creation process dies at different places).
Another test again created a single problematic directory with missing pnfs ID
$ cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)"
cat: /pnfs/fs/usr/cms/local_tests/.(id)(dircreateC-5): No such file or directory
$ date
Tue Sep 8 21:48:09 CEST 2009
Corresponding log entry
09/08/09 21:47:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-0000000000000000 name dircreateC-5 : 000200000000000002BCDFE0 (0) -> 0
Even though the name to ID resolution fails, the ID to name resolution (using the ID from the log) works: