Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M% ---+ 31.05.2013 VOMS Server Issue ---++ Problem ---+++ Initial Trigger The search for this issue was triggered by a user request <pre> Dear admins, I'm having a problem trying to run some jobs on the T2. They are aborted after a few minutes, and I don't see any reason why... Can you help me, please? Thanks in advance. Best regards, Mario </pre> ---+++ More Details Looking at the problem with the user we found aborted grid jobs at CSCS without any useful exit code / error message. Doing a <pre>crab -c <folder> -postMortem <job_id></pre> showed error messages like <pre>error changing sandbox ownership to the user: condor_glexec_setup exited with status 256 and the following output: [gLExec]: LCMAPS failed.; The reason can be found in the logfile.</pre> Checking the logfile =/var/log/glexec/lcas_lcmaps.log= shows the following <pre> glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: checkResponseSanity: Error: the decision for result[0] is Not Applicable. This means your request is not allowed to continue based on this decision. glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: oh_process_uidgid: Error: checkResponseSanity() returned a failure condition in the response message. Stopped looking into the obligations glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: Error: pep_authorize(request,response) failed. The Argus-PEP return code is: 9 with error message: "OH process error" glexec:lcmaps[30581] LOG_ERR: 2013-05-31.14:07:14Z: LCMAPS failed to do mapping and return account information </pre> ---++ Solution ---+++ Initial checks Turns out the problem is coming from an outdated VOMS config; for CMS only the two VOMS servers *lcg-voms.cern.ch* and *voms.cern.ch* should be considered authoritative (see [[http://operations-portal.egi.eu/vo/view/voname/cms]]); also <pre> May 31 21:42 [root@wn70:~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 Apr 9 12:38 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 Apr 9 12:38 /etc/vomses/cms-voms.cern.ch </pre> so the worker node is configured correctly; whereas on our UI <pre> [root@t3ui02 ~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 Feb 8 13:43 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 Feb 8 13:43 /etc/vomses/cms-voms.cern.ch -rw-r--r-- 1 root root 97 Feb 8 13:43 /etc/vomses/cms-voms.fnal.gov </pre> so our UIs still have the old (wrong) configuration. However, I could not find out why it then takes the FNAL VOMS to generate proxies for some users whereas for others (e.g. me) it uses the (correct) CERN VOMS; maybe some kind of round robin? This apparently caused problems for our users before ([[https://savannah.cern.ch/support/?136697]]). ---+++ Root Cause These "vomses" files are generated by yaim so checking =/root/YAIM-config/site-info.def= showed that this is not up to date. Looking at the puppet configuration I found that the file <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/desk/modules/Tier3/files/RedHat/root/YAIM-config/site-info.def</pre> is up to date, but the file that is actually used is not. <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre> ---+++ Mitigation I adapted the file <pre>/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/Tier3/files/RedHat/5/root/YAIM-config/site-info.def__ui</pre> changing the variables =VO_CMS_VOMSES= and =VO_CMS_VOMS_CA_DN= (removing the FNAL entry from both). Then I did on =t3ui07= <pre> puppetd -t -v rm -f /etc/vomses/* /opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomses rm -f /etc/grid-security/vomsdir/cms/* /opt/glite/yaim/bin/yaim -e -s /root/YAIM-config/site-info.def -n UI -r -f config_vomsdir </pre> After this the configuration seems to be correct <pre> [root@t3ui07 ~]# ls -lah /etc/vomses/cms-* -rw-r--r-- 1 root root 94 May 31 18:51 /etc/vomses/cms-lcg-voms.cern.ch -rw-r--r-- 1 root root 86 May 31 18:51 /etc/vomses/cms-voms.cern.ch </pre> and the user could then generate a correct certificate on =t3ui07=. -- Main.DanielMeister - 2013-05-31 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log44][previous page]] / [[CMSTier3Log46][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r1 - 2013-05-31
-
DanielMeister
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback