Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

31.05.2013 VOMS Server Issue

Problem

Initial Trigger

The search for this issue was triggered by a user request

Dear admins,

I'm having a problem trying to run some jobs on the T2.
They are aborted after a few minutes, and I don't see any reason why...

Can you help me, please?

Thanks in advance.

Best regards,
Mario

More Details

Looking at the problem with the user we found aborted grid jobs at CSCS without any useful exit code / error message. Doing a

crab -c <folder> -postMortem <job_id>

showed error messages like

error changing sandbox ownership to the user: condor_glexec_setup exited with status 256 and the following output: [gLExec]: LCMAPS failed.; The reason can be found in the logfile.

Checking the logfile /var/log/glexec/lcas_lcmaps.log shows the following

glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: checkResponseSanity: Error: the decision for result[0] is Not Applicable. This means your request is not allowed to continue based on this decision.
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: oh_process_uidgid: Error: checkResponseSanity() returned a failure condition in the response message. Stopped looking into the obligations
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: Error: pep_authorize(request,response) failed. The Argus-PEP return code is: 9 with error message: "OH process error"
glexec:lcmaps[30581]     LOG_ERR: 2013-05-31.14:07:14Z: LCMAPS failed to do mapping and return account information

Solution

Initial checks

Turns out the problem is coming from an outdated VOMS config; for CMS only the two VOMS servers lcg-voms.cern.ch and voms.cern.ch should be considered authoritative (see http://operations-portal.egi.eu/vo/view/voname/cms); also

May 31 21:42 [root@wn70:~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Apr  9 12:38 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Apr  9 12:38 /etc/vomses/cms-voms.cern.ch

so the worker node is configured correctly; whereas on our UI

[root@t3ui02 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 Feb  8 13:43 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 Feb  8 13:43 /etc/vomses/cms-voms.cern.ch
-rw-r--r-- 1 root root 97 Feb  8 13:43 /etc/vomses/cms-voms.fnal.gov

so our UIs still have the old (wrong) configuration.

However, I could not find out why it then takes the FNAL VOMS to generate proxies for some users whereas for others (e.g. me) it uses the (correct) CERN VOMS; maybe some kind of round robin?

This apparently caused problems for our users before (https://savannah.cern.ch/support/?136697).

Root Cause

These "vomses" files are generated by yaim so checking /root/YAIM-config/site-info.def showed that this is not up to date.

Looking at the puppet configuration I found that the file

/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/desk/modules/Tier3/files/RedHat/root/YAIM-config/site-info.def

is up to date, but the file that is actually used is not.

/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui

Mitigation

I adapted the file

/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui

changing the variables VO_CMS_VOMSES and VO_CMS_VOMS_CA_DN (removing the FNAL entry from both).

Then I did on t3ui07

puppetd -t -v
rm -f /etc/vomses/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomses
rm -f /etc/grid-security/vomsdir/cms/*
/opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomsdir

After this the configuration seems to be correct

[root@t3ui07 ~]# ls -lah /etc/vomses/cms-*
-rw-r--r-- 1 root root 94 May 31 18:51 /etc/vomses/cms-lcg-voms.cern.ch
-rw-r--r-- 1 root root 86 May 31 18:51 /etc/vomses/cms-voms.cern.ch

and the user could then generate a correct certificate on t3ui07.

-- DanielMeister - 2013-05-31


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Topic revision: r1 - 2013-05-31 - DanielMeister
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback