<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M% ---+ 31.05.2013 VOMS Server Issue ---++ Problem ---+++ Initial Trigger The search for this issue was triggered by a user request <pre> Dear admins, I'm having a problem trying to run some jobs on the T2. They are aborted after a few minutes, and I don't see any reason why... Can you help me, please? Thanks in advance. Best regards, Mario </pre> ---+++ More Details Looking at the problem with the user we found aborted grid jobs at CSCS without any useful exit code / error message. Doing a <pre>crab -c <folder> -postMortem <job_id></pre> showed error messages like <pre>error changing sandbox ownership to the user: condor_glexec_setup exited with status 256 and the following output: [gLExec]: LCMAPS failed.; The reason can be found in the logfile.</pre> Checking the logfile =/var/log/glexec/lcas_lcmaps.log= shows the following <pre> glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: checkResponseSanity: Error: the decision for result[0] is Not Applicable. This means your request is not allowed to continue based on this decision. glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: oh_process_uidgid: Error: checkResponseSanity() returned a failure condition in the response message. Stopped looking into the obligations glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: Error: pep_authorize(request,response) failed. The Argus-PEP return code is: 9 with error message: "OH process error" glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: LCMAPS failed to do mapping and return account information </pre> ---++ Solution ---+++ Initial checks Turns out the problem is coming from an outdated VOMS config; for CMS only the two VOMS servers *lcg-voms.cern.ch* and *voms.cern.ch* should be considered authoritative (see [[http://operations-portal.egi.eu/vo/view/voname/cms]]); also <pre> May 31 21:42 [root@wn70:~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 Apr 9 12:38 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 Apr 9 12:38 /etc/vomses/cms-voms.cern.ch </pre> so the worker node is configured correctly; whereas on our UI <pre> [root@t3ui02 ~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 Feb 8 13:43 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 Feb 8 13:43 /etc/vomses/cms-voms.cern.ch -rw-r--r-- 1 root root 97 Feb 8 13:43 /etc/vomses/cms-voms.fnal.gov </pre> so our UIs still have the old (wrong) configuration. However, I could not find out why it then takes the FNAL VOMS to generate proxies for some users whereas for others (e.g. me) it uses the (correct) CERN VOMS; maybe some kind of round robin? This apparently caused problems for our users before ([[https://savannah.cern.ch/support/?136697]]). ---+++ Root Cause These "vomses" files are generated by yaim so checking =/root/YAIM-config/site-info.def= showed that this is not up to date. Looking at the puppet configuration I found that the file <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/desk/modules/Tier3/files/RedHat/root/YAIM-config/site-info.def</pre> is up to date, but the file that is actually used is not. <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre> ---+++ Mitigation I adapted the file <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre> changing the variables =VO_CMS_VOMSES= and =VO_CMS_VOMS_CA_DN= (removing the FNAL entry from both). Then I did on =t3ui07= <pre> puppetd -t -v rm -f /etc/vomses/* /opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomses rm -f /etc/grid-security/vomsdir/cms/* /opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomsdir </pre> After this the configuration seems to be correct <pre> [root@t3ui07 ~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 May 31 18:51 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 May 31 18:51 /etc/vomses/cms-voms.cern.ch </pre> and the user could then generate a correct certificate on =t3ui07=. -- Main.DanielMeister - 2013-05-31 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log45
Topic revision: r1 - 2013-05-31 - DanielMeister
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback