Go to
previous page /
next page of Tier3 site log
06. 10. 2008 Test user feedback
In the week of Oct 6th we are working with test users to get the user environment production ready.
ROOT/libX error
Enable getting Kerberos tickets from CERN
need to source cmsset_default.(c)sh
- need to source
$VO_CMS_SW_DIR/cmsset_default.csh
(see here)
DerekFeichtinger: This was actually decided on purpose. I wanted to provide the possibility for users to run without a CMS environment if they want. The automatic sourcing was activated for some time, but I decided against it. Should we activate it again?
tcsh support for SGE
- no
sge.csh
in /etc/profile.d/
(FIXED)
DerekFeichtinger: added CSH environment file.
Failure of a particular CMSSW run
Christina sees a certain CMSSW job of hers fail if submitted to the queue.
Debugging (
DerekFeichtinger):
Environment setup for interactive tests on the WN:
CMSSW_DIR=/shome/eggel/examples
CMSSW_CONFIG_FILE=$CMSSW_DIR/python/HFExampleBJK_cfg.py
source $VO_CMS_SW_DIR/cmsset_default.sh
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
Strace of cmsRun shows that the process gets stuck in a very bad way after this system call (not even SIGKILL will stop it, need to kill the parent) :
....
[pid 9502] fcntl64(11, 0xd /* F_??? */ <unfinished ...>
Looking interactively, we see that fd 11 belongs to this file.
/swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db
But what fcntl mode was being attempted is unclear (F_???). Might be some issue with locking or similar (but locking a public file on the shared area would be very stupid)
This is a SQLite file. For interactively opening it with sqlite3 (also used by CMSSW) I needed to install additional packages on the WN:
- ncurses.i386
- readline.i386
which sqlite3
/swshare/cms/slc4_ia32_gcc345/external/sqlite/3.4.0-cms2/bin/sqlite3
sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db
I can attach to the file. But when I give a command like ".schema" the process gets stuck.
Ok. Doing an strace of the interactive sqlite3 session:
strace -fo sqlite-strace.log sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db
And this nicely reproduces the problem. The process gets stuck at:
...
9747 fcntl64(3, 0xd /* F_??? */
Ok.... On the UI, I can do all of the above. I can open the file and list the schema without problem.
Difference in the NFS mount:
- On the UI: t3nfs01:/swshare on /swshare type nfs (rw,nolock,addr=...)
- On the WN: t3nfs01:/swshare on /swshare type nfs (rw,addr=...)
Changing the mount options on the WN:
mount -o remount,rw,nolock,addr=... t3nfs01:/swshare /swshare
This does not solve the problem (and the rationale would have been difficult to understand). WN process still gets stuck.
Updated the kernel on that machine to 2.6.9-67.0.15.ELsmp. This at least had the effect that the call returned after 20 seconds or so, with "Error: database is locked". An strace of this showed
4154 fcntl64(3, 0xd /* F_??? */, 0xffff33e0) = -1 ENOLCK (No locks available)
Looking to the
fcntl64
call: In
/usr/lib/x86_64-redhat-linux3E/include/bits/fcntl.h
(hopefully the right file...) we can find the definitions for the call. 0xd translates to 13:
# define F_SETLK64 13 /* Set record locking info (non-blocking). */
Ok... this suggests a very simple reason (for bashing myself....). The nfs lock daemons were not running.
Turning them on (and making sure that they stay!!!!) enabled sqlite to access the file correctly....
Please test whether your jobs work now. They do
--
DerekFeichtinger - 07 Oct 2008
Go to
previous page /
next page of Tier3 site log