<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> KeyWords: SysAdmin ---+ documentation of PNFS related problems on our dcache installation We began looking in more depth at the problem of root owned directories in PNFS, i.e. directories that were created during normal user requests, but that for some reason had the ownership root.root, leading to errors for user requests trying to write to these areas. The problem became particularly bad shortly after upgrading from dCache 1.8.0-15p to 1.9.3, but we cannot really say whether the worsening of the condition has any connection to the new version. When additional service problems became apparent, we decided to rush into a migration of pnfs to chimera, which showed a number of other problems. All of this prompted us to examine out PNFS a little closer. One case where root owned directories appear, and their behavior <pre> seq 9|xargs -P9 -n1 --replace srmmkdir srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-{} Return code: SRM_FAILURE Explanation: srm://storage01.lcg.cscs.ch:8443/pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9 Failed to create, got error return code from pnfs: path /pnfs/fs/usr/cms/local_tests/dircreate-9 not found ( .(id)(dircreate-9) ) $ ls -ld dircreate-9 drwxr-xr-x 1 root root 512 Sep 8 16:49 dircreate-9 $ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)" cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory $ chown cmsprd.cms /pnfs/lcg.cscs.ch/cms/local_tests/dircreate-9 $ ls -ld dircreate-9 drwxr-xr-x 1 cmsprd cms 512 Sep 8 16:49 dircreate-9 $ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)" cat: /pnfs/lcg.cscs.ch/cms/local_tests/.(id)(dircreate-9): No such file or directory </pre> Strange enough, when I checked the file 2 hours later, it had got a pnfs ID. <pre> $ cat /pnfs/lcg.cscs.ch/cms/local_tests/".(id)(dircreate-9)" 000200000000000002BCD9E0 </pre> How can that be? I see two possibilities * the original creation process had been stuck, and finished at some point * there is a repair process going over the filesystem in intervals (this would be strange) Checking in the pnfs log, I can identify two entries <pre> 09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-000000000000 0000 name dircreate-9 : %GREEN%000200000000000002BCD9E0%ENDCOLOR% (0) -> 0 09/08/09 16:52:38 127.0.0.1-0-0(0,1,2,3,4,6,10,) - setattr %GREEN%000200000000000002BCD9E0%ENDCOLOR%-000000000000000 0 uid=4199;gid=4001;size=-1;mode=37777777777;a=ffffffff;m=ffffffff (0) -> 0 </pre> The settatr entry probably derives from my manual ownership change. Let's compare with the entries of one of the successfully created directories: dircreate-1 <pre> 09/08/09 16:49:43 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-000000000000 0000 name dircreate-1 : %GREEN%000200000000000002BCD9B8%ENDCOLOR% (0) -> 0 09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000 name .(pset)(%GREEN%000200000000000002BCD9B8%ENDCOLOR%)(attr)(0)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid= -1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=%GREEN%000200000000000002BCD9B8%ENDCOLOR%;;level=0;;line=100775:4199:4001: 4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000000 (0) -> 0 09/08/09 16:49:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - create dir 000000000000000000001040-0000000000000000 name .(pset)(%GREEN%000200000000000002BCD9B8%ENDCOLOR%)(attr)(1)(100775:4199:4001:4aa66f08:4aa66f08:4aa66f08) uid=0;gid= -1;size=-1;mode=100644;a=ffffffff;m=ffffffff;id=%GREEN%000200000000000002BCD9B8%ENDCOLOR%;;level=1;;line=100775:4199:4001: 4aa66f08:4aa66f08:4aa66f08; : 000200000000000002BCD9B9-0000001B00000001 (0) -> 0 </pre> The *pset* lines seem to be exact copies of each other, except for the *level=X* part. In the case of the faulty "root owned" directory dircreate-9 bove, the pset line is missing completely. Just did a second sequence of the whole test. This time two directories ended up root owned. But both of them had pnfs IDs. So, it seems that these symptoms are not necessarily coupled (maybe the creation process dies at different places). Another test again created a single problematic directory with missing pnfs ID <pre> $ cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)" cat: /pnfs/fs/usr/cms/local_tests/.(id)(dircreateC-5): No such file or directory $ date Tue Sep 8 21:48:09 CEST 2009 </pre> Corresponding log entry <pre> 09/08/09 21:47:44 127.0.0.1-0-0(0,1,2,3,4,6,10,) - mkdir dir 0002000000000000000010C0-0000000000000000 name dircreateC-5 : 000200000000000002BCDFE0 (0) -> 0 </pre> Even though the name to ID resolution fails, the ID to name resolution (using the ID from the log) works: <pre> $ cat /pnfs/lcg.cscs.ch/cms/".(name)(000200000000000002BCDFE0)" dircreateC-5 </pre> This file showed the same behavior as the other such cases: In the morning of the next day, the ID was printed correctly when invoking the dot(id) command. The file still belonged to root.root. In the pnfs log file, there still only was the "mkdir" line that I could associate with this directory. No other line contained the matching pnfs ID. <pre> Wed Sep 9 09:26:44 CEST 2009 cat /pnfs/fs/usr/cms/local_tests/".(id)(dircreateC-5)" 000200000000000002BCDFE0 ls -ld /pnfs/fs/usr/cms/local_tests/dircreateC-5 drwxr-xr-x 1 root root 512 Sep 8 21:47 /pnfs/fs/usr/cms/local_tests/dircreateC-5 </pre> <!-- ---++ Readers' comments COMMENT{type="below"} -->
This topic: LCGTier2
>
WebHome
>
PhoenixClusterBlog
>
PhoenixBlog20090908x1652
Topic revision: r4 - 2009-09-09 - DerekFeichtinger
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback