Tags:
create new tag
view all tags

KeyWords: SysAdmin

documentation of PNFS related problems on our dcache installation

We began looking in more depth at the problem of root owned directories in PNFS, i.e. directories that were created during normal user requests, but that for some reason had the ownership root.root, leading to errors for user requests trying to write to these areas. The problem became particularly bad shortly after upgrading from dCache 1.8.0-15p to 1.9.3, but we cannot really say whether the worsening of the condition has any connection to the new version. When additional service problems became apparent, we decided to rush into a migration of pnfs to chimera, which showed a number of other problems. All of this prompted us to examine out PNFS a little closer.

One case where root owned directories appear, and their behavior

seq 9|xargs -P9 -n1 --replace srmmkdir srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-{}

Return code: SRM_FAILURE
Explanation: srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/dircreate-9 not found ( .(id)(dircreate-9) )

$ ls -ld dircreate-9
drwxr-xr-x 1 root   root 512 Sep  8 16:49 dircreate-9

$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory

$ chown cmsprd.cms /pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9
$ ls -ld dircreate-9
drwxr-xr-x 1 cmsprd cms 512 Sep  8 16:49 dircreate-9

$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory

Strange enough, when I checked the file 2 hours later, it had got a pnfs ID.

$ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)"
000200000000000002BCD9E0

How can that be? I see two possibilities

  • the original creation process had been stuck, and finished at some point
  • there is a repair process going over the filesystem in intervals (this would be strange)

Checking in the pnfs log, I can identify two entries

 09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir  dir 0002000000000000000010C0-000000000000
0000 name dircreate-9  : 000200000000000002BCD9E0 (0)  -> 0

 09/08/09 16:52:38 127.0.0.1-0-0(0,1,2,3,4,6,10,) - setattr 000200000000000002BCD9E0-000000000000000
0 uid=4199;gid=4001;size=-1;mode=37777777777;a=ffffffff;m=ffffffff  (0)  -> 0

The settatr entry probably derives from my manual ownership change.

Let's compare with the entries of one of the successfully created directories: dircreate-1

09/08/09 16:49:43 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir  dir 0002000000000000000010C0-000000000000
0000 name dircreate-1  : 000200000000000002BCD9B8 (0)  -> 0


09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create  dir 000000000000000000001040-0000000000000000
 name .(pset)(000200000000000002BCD9B8)(attr)(0)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08)  uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=0;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000000  (0)  -> 0


09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create  dir 000000000000000000001040-0000000000000000
 name .(pset)(000200000000000002BCD9B8)(attr)(1)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08)  uid=0;gid=
-1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=000200000000000002BCD9B8;;level=1;;line=100775:4199:4001:
4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000001  (0)  -> 0

The pset lines seem to be exact copies of each other, except for the level=X part. In the case of the faulty "root owned" directory dircreate-9 bove, the pset line is missing completely.

Just did a second sequence of the whole test. This time two directories ended up root owned. But both of them had pnfs IDs. So, it seems that these symptoms are not necessarily coupled (maybe the creation process dies at different places).

Another test again created a single problematic directory with missing pnfs ID

$ cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)"
cat: /pnfs/fs/usr/cms/local_tests/.(id)(dircreateC-5): No such file or directory
$ date
Tue Sep  8 21:48:09 CEST 2009

Corresponding log entry

 09/08/09 21:47:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir  dir 0002000000000000000010C0-0000000000000000 name dircreateC-5  : 000200000000000002BCDFE0 (0)  -> 0

Even though the name to ID resolution fails, the ID to name resolution (using the ID from the log) works:

$ cat /pnfs/lcg.cscs.ch/cms/".(name)(000200000000000002BCDFE0)"
dircreateC-5

This file showed the same behavior as the other such cases: In the morning of the next day, the ID was printed correctly when invoking the dot(id) command. The file still belonged to root.root. In the pnfs log file, there still only was the "mkdir" line that I could associate with this directory. No other line contained the matching pnfs ID.

Wed Sep 9 09:26:44 CEST 2009
cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)"
000200000000000002BCDFE0

ls -ld /pnfs/fs/usr/cms/local_tests/dircreateC-5
drwxr-xr-x 1 root root 512 Sep  8 21:47 /pnfs/fs/usr/cms/local_tests/dircreateC-5

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2009-09-09 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback