Go to
previous page /
next page of Tier3 site log
31.05.2013 VOMS Server Issue
Problem
Initial Trigger
The search for this issue was triggered by a user request
Dear admins,
I'm having a problem trying to run some jobs on the T2.
They are aborted after a few minutes, and I don't see any reason why...
Can you help me, please?
Thanks in advance.
Best regards,
Mario
More Details
Looking at the problem with the user we found aborted grid jobs at CSCS without any useful exit code / error message. Doing a
crab -c <folder> -postMortem <job_id>
showed error messages like
error changing sandbox ownership to the user: condor_glexec_setup exited with status 256 and the following output: [gLExec]: LCMAPS failed.; The reason can be found in the logfile.
Checking the logfile
/var/log/glexec/lcas_lcmaps.log
shows the following
glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: checkResponseSanity: Error: the decision for result[0] is Not Applicable. This means your request is not allowed to continue based on this decision.
glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: oh_process_uidgid: Error: checkResponseSanity() returned a failure condition in the response message. Stopped looking into the obligations
glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: Error: pep_authorize(request,response) failed. The Argus-PEP return code is: 9 with error message: "OH process error"
glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: LCMAPS failed to do mapping and return account information
Solution
Initial checks
Turns out the problem is coming from an outdated VOMS config; for CMS only the two VOMS servers
lcg-voms.cern.ch and
voms.cern.ch should be considered authoritative (see
http://operations-portal.egi.eu/vo/view/voname/cms); also
May 31 21:42 [root@wn70:~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Apr 9 12:38 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Apr 9 12:38 /etc/vomses/cms-voms.cern.ch
so the worker node is configured correctly; whereas on our UI
[root@t3ui02 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Feb 8 13:43 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Feb 8 13:43 /etc/vomses/cms-voms.cern.ch
-rw-r--r-- 1 root root 97 Feb 8 13:43 /etc/vomses/cms-voms.fnal.gov
so our UIs still have the old (wrong) configuration.
However, I could not find out why it then takes the FNAL VOMS to generate proxies for some users whereas for others (e.g. me) it uses the (correct) CERN VOMS; maybe some kind of round robin?
This apparently caused problems for our users before (
https://savannah.cern.ch/support/?136697).
Root Cause
These "vomses" files are generated by yaim so checking
/root/YAIM-config/site-info.def
showed that this is not up to date.
Looking at the puppet configuration I found that the file
/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/desk/modules/Tier3/files/RedHat/root/YAIM-config/site-info.def
is up to date, but the file that is actually used is not.
/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui
Mitigation
I adapted the file
/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui
changing the variables
VO_CMS_VOMSES
and
VO_CMS_VOMS_CA_DN
(removing the FNAL entry from both).
Then I did on
t3ui07
puppetd -t -v
rm -f /etc/vomses/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomses
rm -f /etc/grid-security/vomsdir/cms/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomsdir
After this the configuration seems to be correct
[root@t3ui07 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 May 31 18:51 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 May 31 18:51 /etc/vomses/cms-voms.cern.ch
and the user could then generate a correct certificate on
t3ui07
.
--
DanielMeister - 2013-05-31
Go to
previous page /
next page of Tier3 site log