KeyWords:
SysAdmin
documentation of PNFS related problems on our dcache installation
We began looking in more depth at the problem of root owned directories in PNFS, i.e. directories that were created during normal user requests, but that for some reason had the ownership root.root, leading to errors for user requests trying to write to these areas.
The problem became particularly bad shortly after upgrading from dCache 1.8.0-15p to 1.9.3, but we cannot really say whether the worsening of the condition has any connection to the new version. When additional service problems became apparent, we decided to rush into a migration of pnfs to chimera, which showed a number of other problems. All of this prompted us to examine out PNFS a little closer.
One case where root owned directories appear, and their behavior
seq 9|xargs -P9 -n1 --replace srmmkdir srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-{}
Return code: SRM_FAILURE
Explanation: srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/dircreate-9 not found ( .(id)(dircreate-9) )
$ ls -ld dircreate-9
drwxr-xr-x 1 root root 512 Sep 8 16:49 dircreate-9
$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory
$ chown cmsprd.cms /pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9
$ ls -ld dircreate-9
drwxr-xr-x 1 cmsprd cms 512 Sep 8 16:49 dircreate-9
$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory
Strange enough, when I checked the file 2 hours later, it had got a pnfs ID.
$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
000200000000000002BCD9E0
How can that be? I see two possibilities
- the original creation process had been stuck, and finished at some point
- there is a repair process going over the filesystem in intervals (this would be strange)
Checking in the pnfs log, I can identify two entries
09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-000000000000
0000 name dircreate-9 : 000200000000000002BCD9E0 (0) -> 0
09/08/09 16:52:38 127.0.0.1-0-0(0,1,2,3,4,6,10,) - setattr 000200000000000002BCD9E0-000000000000000
0 uid=4199;gid=4001;size=-1;mode=37777777777;a=ffffffff;m=ffffffff (0) -> 0
The settatr entry probably derives from my manual ownership change.
Let's compare with the entries of one of the successfully created directories: dircreate-1
09/08/09 16:49:43 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-000000000000
0000 name dircreate-1 : 000200000000000002BCD9B8 (0) -> 0
09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000
name .(pset)(000200000000000002BCD9B8)(attr)(0)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=0;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000000 (0) -> 0
09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000
name .(pset)(000200000000000002BCD9B8)(attr)(1)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=1;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000001 (0) -> 0
The
pset lines seem to be exact copies of each other, except for the
level=X part. In the case of the faulty "root owned" directory dircreate-9 bove, the pset line is missing completely.
Just did a second sequence of the whole test. This time two directories ended up root owned. But both of them had pnfs IDs. So, it seems that these symptoms are not necessarily coupled (maybe the creation process dies at different places).
Another test again created a single problematic directory with missing pnfs ID
$ cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)"
cat: /pnfs/fs/usr/cms/local_tests/.(id)(dircreateC-5): No such file or directory
$ date
Tue Sep 8 21:48:09 CEST 2009
Corresponding log entry
09/08/09 21:47:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-0000000000000000 name dircreateC-5 : 000200000000000002BCDFE0 (0) -> 0
Even though the name to ID resolution fails, the ID to name resolution (using the ID from the log) works:
$ cat /pnfs/lcg.cscs.ch/cms/".(name)(000200000000000002BCDFE0)"
dircreateC-5
This file showed the same behavior as the other such cases: In the morning of the next day, the ID was printed correctly when invoking the dot(id) command. The file still belonged to root.root. In the pnfs log file, there still only was the "mkdir" line that I could associate with this directory. No other line contained the matching pnfs ID.
Wed Sep 9 09:26:44 CEST 2009
cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)"
000200000000000002BCDFE0
ls -ld /pnfs/fs/usr/cms/local_tests/dircreateC-5
drwxr-xr-x 1 root root 512 Sep 8 21:47 /pnfs/fs/usr/cms/local_tests/dircreateC-5