CMSTier3Log45 < CmsTier3

<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup
-->

%TOC%

%ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M%

---+ 31.05.2013 VOMS Server Issue

---++ Problem

---+++ Initial Trigger

The search for this issue was triggered by a user request

<pre>
Dear admins,

I'm having a problem trying to run some jobs on the T2.
They are aborted after a few minutes, and I don't see any reason why...

Can you help me, please?

Thanks in advance.

Best regards,
Mario
</pre>

---+++ More Details

Looking at the problem with the user we found aborted grid jobs at CSCS without any useful exit code / error message. Doing a

<pre>crab -c &lt;folder&gt; -postMortem &lt;job_id&gt;</pre>

showed error messages like 

<pre>error changing sandbox ownership to the user: condor_glexec_setup exited with status 256 and the following output: [gLExec]: LCMAPS failed.; The reason can be found in the logfile.</pre>

Checking the logfile =/var/log/glexec/lcas_lcmaps.log= shows the following

<pre>
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: checkResponseSanity: Error: the decision for result[0] is Not Applicable. This means your request is not allowed to continue based on this decision.
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: oh_process_uidgid: Error: checkResponseSanity() returned a failure condition in the response message. Stopped looking into the obligations
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: Error: pep_authorize(request,response) failed. The Argus-PEP return code is: 9 with error message: "OH process error"
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: LCMAPS failed to do mapping and return account information
</pre>

---++ Solution

---+++ Initial checks

Turns out the problem is coming from an outdated VOMS config; for CMS only the two VOMS servers *lcg-voms.cern.ch* and *voms.cern.ch* should be considered authoritative (see [[http://operations-portal.egi.eu/vo/view/voname/cms]]); also

<pre>
May 31 21:42 [root@wn70:~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Apr  9 12:38 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Apr  9 12:38 /etc/vomses/cms-voms.cern.ch
</pre>

so the worker node is configured correctly; whereas on our UI

<pre>
[root@t3ui02 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Feb  8 13:43 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Feb  8 13:43 /etc/vomses/cms-voms.cern.ch
-rw-r--r-- 1 root root 97 Feb  8 13:43 /etc/vomses/cms-voms.fnal.gov
</pre>

so our UIs still have the old (wrong) configuration.

However, I could not find out why it then takes the FNAL VOMS to generate proxies for some users whereas for others (e.g. me) it uses the (correct) CERN VOMS; maybe some kind of round robin?

This apparently caused problems for our users before ([[https://savannah.cern.ch/support/?136697]]).

---+++ Root Cause

These "vomses" files are generated by yaim so checking =/root/YAIM-config/site-info.def= showed that this is not up to date.

Looking at the puppet configuration I found that the file 

<pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/desk/modules/Tier3/files/RedHat/root/YAIM-config/site-info.def</pre>

is up to date, but the file that is actually used is not.

<pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre>

---+++ Mitigation

I adapted the file

<pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre>

changing the variables =VO_CMS_VOMSES= and =VO_CMS_VOMS_CA_DN= (removing the FNAL entry from both).

Then I did on =t3ui07=

<pre>
puppetd -t -v
rm -f /etc/vomses/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomses
rm -f /etc/grid-security/vomsdir/cms/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomsdir
</pre>

After this the configuration seems to be correct

<pre>
[root@t3ui07 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 May 31 18:51 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 May 31 18:51 /etc/vomses/cms-voms.cern.ch
</pre>

and the user could then generate a correct certificate on =t3ui07=.


-- Main.DanielMeister - 2013-05-31

----------------

%ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M%
This topic: CmsTier3 > WebHome > CMSTier3Log > CMSTier3Log45
Topic revision: r1 - 2013-05-31 - DanielMeister