Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

06. 10. 2008 Test user feedback

In the week of Oct 6th we are working with test users to get the user environment production ready.

Feedback from FredericRonga:

DONE ROOT/libX error

  • ROOT error
    (Maybe because the system libraries are in 64 bits, but CMS software (including ROOT) is in 32 bits?)
    > root -l PATLayer1_Output.fromScratch_fast.root
    root: error while loading shared libraries: libXpm.so.4: cannot open shared object file: No such file or directory
    
    DerekFeichtinger: I installed xorg-x11-devel.i386 and other i386 architecture dependencies on the UI.

DONE Enable getting Kerberos tickets from CERN

  • can't get a Kerberos ticket (for CVS checkout)
    > cvs co -rCMSSW_1_6_12 PhysicsTools/PatAlgos
    cvs [checkout aborted]: kerberos authentication failed: You have no tickets cached
    > kinit
    kinit(v5): Client not found in Kerberos database while getting initial credentials
    

    DerekFeichtinger: I added the CERN.CH realm to the /etc/krb5.conf file and /etc/krb.conf files on the UI. Users can now get Kerberos5 and Kerberos4 tickets from the CERN realm. Both can be used to access the CMS CVS, but you need different CVSROOT settings.
     # using Kerberos5
    kinit cern-username@CERN.CH
    cvs -d :gserver:cmscvs.cern.ch:/cvs_server/repositories/CMSSW co -d test2 COMP/PHEDEX
    
    # using Kerberos4
    kinit -4 dfeichti@CERN.CH
    cvs -d :kserver:cmscvs.cern.ch:/cvs_server/repositories/CMSSW co -d test COMP/PHEDEX
    Please test.

ALERT! need to source cmsset_default.(c)sh

  • ALERT! need to source $VO_CMS_SW_DIR/cmsset_default.csh (see here)
    DerekFeichtinger: This was actually decided on purpose. I wanted to provide the possibility for users to run without a CMS environment if they want. The automatic sourcing was activated for some time, but I decided against it. Should we activate it again?

DONE tcsh support for SGE

  • no sge.csh in /etc/profile.d/ (FIXED)
    DerekFeichtinger: added CSH environment file.

No /tmp is too small

DerekFeichtinger: The default installation we used for the UI only created a 2GB partition for /tmp. This is too small, because many users use this area for tests (e.g. test copies of SE files). Will put /tmp onto the root partition for now, and repartition cleanly at a later point. Temporary fix is implemented.

No CRAB resubmission does not work

Frederic reported on crab -resubmit not working correctly.

DerekFeichtinger: Currently, work must be done just using -create and -submit. The -getoutput option has no meaning on the Tier-3, since the output is anyways copied back to the shared directory. This behavior on the local batch system is in part responsible for the problem (there's a flaw in the general implementation of how CRAB treats this case). ZhilingChen is working on this issue.

Feedback from ChristinaEggel:

DONE Failure of a particular CMSSW run (due to a SQLite access issue)

Christina sees a certain CMSSW job of hers fail if submitted to the queue.

Debugging (DerekFeichtinger):

Environment setup for interactive tests on the WN:

CMSSW_DIR=/shome/eggel/examples
CMSSW_CONFIG_FILE=$CMSSW_DIR/python/HFExampleBJK_cfg.py
source $VO_CMS_SW_DIR/cmsset_default.sh
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`

Strace of cmsRun shows that the process gets stuck in a very bad way after this system call (not even SIGKILL will stop it, need to kill the parent) :

....
[pid  9502] fcntl64(11, 0xd /* F_??? */ <unfinished ...>

Looking interactively, we see that fd 11 belongs to this file. /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db

But what fcntl mode was being attempted is unclear (F_???). Might be some issue with locking or similar (but locking a public file on the shared area would be very stupid)

This is a SQLite file. For interactively opening it with sqlite3 (also used by CMSSW) I needed to install additional packages on the WN:

  • ncurses.i386
  • readline.i386
which sqlite3
/swshare/cms/slc4_ia32_gcc345/external/sqlite/3.4.0-cms2/bin/sqlite3

sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db

I can attach to the file. But when I give a command like ".schema" the process gets stuck.

Ok. Doing an strace of the interactive sqlite3 session:

strace -fo sqlite-strace.log sqlite3 /swshare/cms/slc4_ia32_gcc345/cms/data-CondCore-SQLiteData/24/CondCore/SQLiteData/data/MVAJetTagsFakeConditions.db

And this nicely reproduces the problem. The process gets stuck at:

...
9747  fcntl64(3, 0xd /* F_??? */

Ok.... On the UI, I can do all of the above. I can open the file and list the schema without problem.

Difference in the NFS mount:

  • On the UI: t3nfs01:/swshare on /swshare type nfs (rw,nolock,addr=...)
  • On the WN: t3nfs01:/swshare on /swshare type nfs (rw,addr=...)

Changing the mount options on the WN:

mount -o remount,rw,nolock,addr=... t3nfs01:/swshare /swshare

This does not solve the problem (and the rationale would have been difficult to understand). WN process still gets stuck.

Updated the kernel on that machine to 2.6.9-67.0.15.ELsmp. This at least had the effect that the call returned after 20 seconds or so, with "Error: database is locked". An strace of this showed

4154  fcntl64(3, 0xd /* F_??? */, 0xffff33e0) = -1 ENOLCK (No locks available)

Looking to the fcntl64 call: In /usr/lib/x86_64-redhat-linux3E/include/bits/fcntl.h (hopefully the right file...) we can find the definitions for the call. 0xd translates to 13:

# define F_SETLK64      13      /* Set record locking info (non-blocking).  */

Ok... this suggests a very simple reason (for bashing myself....). The nfs lock daemons were not running. Turning them on (and making sure that they stay!!!!) enabled sqlite to access the file correctly....

Please test whether your jobs work now. They do smile

-- DerekFeichtinger - 07 Oct 2008


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2008-10-30 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback