Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log0][previous page]] / [[CMSTier3Log2][next page]] of Tier3 site log %M% ---+ 06. 10. 2008 Test user feedback In the week of Oct 6th we are working with test users to get the user environment production ready. ---++ Feedback from Main.FredericRonga: ---+++ %Y% ROOT/libX error * ROOT error <br>(Maybe because the system libraries are in 64 bits, but CMS software (including ROOT) is in 32 bits?) <verbatim> > root -l PATLayer1_Output.fromScratch_fast.root root: error while loading shared libraries: libXpm.so.4: cannot open shared object file: No such file or directory </verbatim> %GREEN% Main.DerekFeichtinger: I installed =xorg-x11-devel.i386= and other i386 architecture dependencies on the UI.%ENDCOLOR% ---+++ %Y% Enable getting Kerberos tickets from CERN * can't get a Kerberos ticket (for CVS checkout) <verbatim> > cvs co -rCMSSW_1_6_12 PhysicsTools/PatAlgos cvs [checkout aborted]: kerberos authentication failed: You have no tickets cached > kinit kinit(v5): Client not found in Kerberos database while getting initial credentials </verbatim> <br>%GREEN% Main.DerekFeichtinger: I added the CERN.CH realm to the =/etc/krb5.conf= file and /etc/krb.conf files on the UI. Users can now get Kerberos5 and Kerberos4 tickets from the CERN realm. Both can be used to access the CMS CVS, but you need different CVSROOT settings. <pre> # using Kerberos5 kinit cern-username@CERN.CH cvs -d :gserver:cmscvs.cern.ch:/cvs_server/repositories/CMSSW co -d test2 COMP/PHEDEX # using Kerberos4 kinit -4 dfeichti@CERN.CH cvs -d :kserver:cmscvs.cern.ch:/cvs_server/repositories/CMSSW co -d test COMP/PHEDEX</pre>Please test.%ENDCOLOR% ---+++ %X% need to source cmsset_default.(c)sh * %X% need to source =$VO_CMS_SW_DIR/cmsset_default.csh= (see [[HowToSubmitJobs#an_example_CMSSW_job][here]]) <br> %GREEN% Main.DerekFeichtinger: This was actually decided on purpose. I wanted to provide the possibility for users to run without a CMS environment if they want. The automatic sourcing was activated for some time, but I decided against it. Should we activate it again? %ENDCOLOR% ---+++ %Y% tcsh support for SGE * no =sge.csh= in =/etc/profile.d/= (FIXED) <br>%GREEN% Main.DerekFeichtinger: added CSH environment file.%ENDCOLOR% ---+++ %ICON{"choice-no"}% /tmp is too small %GREEN% Main.DerekFeichtinger: The default installation we used for the UI only created a 2GB partition for /tmp. This is too small, because many users use this area for tests (e.g. test copies of SE files). Will put /tmp onto the root partition for now, and repartition cleanly at a later point. Temporary fix is implemented.%ENDCOLOR% ---+++ %ICON{"choice-no"}% CRAB resubmission does not work Frederic reported on =crab -resubmit= not working correctly. %GREEN% Main.DerekFeichtinger: Currently, work must be done just using =-create= and =-submit=. The =-getoutput= option has no meaning on the Tier-3, since the output is anyways copied back to the shared directory. This behavior on the local batch system is in part responsible for the problem (there's a flaw in the general implementation of how CRAB treats this case). Main.ZhilingChen is working on this issue.%ENDCOLOR% ---++ Feedback from Main.ChristinaEggel: ---+++ %Y% Failure of a particular CMSSW run (due to a SQLite access issue) Christina sees a certain CMSSW job of hers fail if submitted to the queue. Debugging (Main.DerekFeichtinger): Environment setup for interactive tests on the WN: <pre> CMSSW_DIR=/shome/eggel/examples CMSSW_CONFIG_FILE=$CMSSW_DIR/python/HFExampleBJK_cfg.py source $VO_CMS_SW_DIR/cmsset_default.sh cd $CMSSW_DIR/src eval `scramv1 runtime -sh` </pre> Strace of cmsRun shows that the process gets stuck in a very bad way after this system call (not even SIGKILL will stop it, need to kill the parent) : <verbatim> .... [pid 9502] fcntl64(11, 0xd /* F_??? */ <unfinished ...> </verbatim> Looking interactively, we see that fd 11 belongs to this file. =/swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db= But what fcntl mode was being attempted is unclear (F_???). Might be some issue with locking or similar (but locking a public file on the shared area would be very stupid) This is a SQLite file. For interactively opening it with sqlite3 (also used by CMSSW) I needed to install additional packages on the WN: * ncurses.i386 * readline.i386 <pre> which sqlite3 /swshare/cms/slc4_ia32_gcc345/external/sqlite/3.4.0-cms2/bin/sqlite3 sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db </pre> I can attach to the file. But when I give a command like ".schema" the process gets stuck. Ok. Doing an strace of the interactive sqlite3 session: <pre> strace -fo sqlite-strace.log sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db </pre> And this nicely reproduces the problem. The process gets stuck at: <pre> ... 9747 fcntl64(3, 0xd /* F_??? */ </pre> Ok.... On the UI, I can do all of the above. I can open the file and list the schema without problem. Difference in the NFS mount: * On the UI: t3nfs01:/swshare on /swshare type nfs (rw,nolock,addr=...) * On the WN: t3nfs01:/swshare on /swshare type nfs (rw,addr=...) Changing the mount options on the WN: <pre> mount -o remount,rw,nolock,addr=... t3nfs01:/swshare /swshare </pre> This does not solve the problem (and the rationale would have been difficult to understand). WN process still gets stuck. Updated the kernel on that machine to 2.6.9-67.0.15.ELsmp. This at least had the effect that the call returned after 20 seconds or so, with "Error: database is locked". An strace of this showed <pre> 4154 fcntl64(3, 0xd /* F_??? */, 0xffff33e0) = -1 ENOLCK (No locks available) </pre> Looking to the =fcntl64= call: In =/usr/lib/x86_64-redhat-linux3E/include/bits/fcntl.h= (hopefully the right file...) we can find the definitions for the call. 0xd translates to 13: <pre> # define F_SETLK64 13 /* Set record locking info (non-blocking). */ </pre> Ok... this suggests a very simple reason (for bashing myself....). The nfs lock daemons were not running. Turning them on (and making sure that they stay!!!!) enabled sqlite to access the file correctly.... Please test whether your jobs work now. They do :-) -- Main.DerekFeichtinger - 07 Oct 2008 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log0][previous page]] / [[CMSTier3Log1][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r11
<
r10
<
r9
<
r8
<
r7
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r11 - 2008-10-30
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback