https://github.com/jpata/cms-ch-ops

Lists to follow

Monitoring

Some central monitoring links from the CERN-kibana (modern, experimental)

  • CSCS job type fractions 7d 90d
  • CSCS job failure rate 7d 90d
  • CSCS CPU efficiency 7d 90d
  • Job exit code (production) 7d
  • CRAB exit code (analysis) 7d
  • Compare 3 sites, job failures 7d
  • Compare 3 sites, job fractions 7d

Site Readiness

  • Site Readiness Logic
  • CSCS Site Readiness vs ALL T2s Site Readiness
  • SSB: 1 2
  • Site Availability Metrics, Computing Element availability
  • glExec failures (Kibana)
  • Usable sites this file is consumed by CRAB3 to decide if CSCS is OK or NOT
  • # Last 24h of CMS Site Readiness @ @CSCS # https://dashb-ssb.cern.ch/dashboard/request.py/sitehistory?site=T2_CH_CSCS#currentView=Site+Readiness&time=custom&start_date=2016-09-28&end_date=2016-09-29&values=false&spline=false&white=false TODAY=` date +%F -d "-0 days"` YESTERDAY=`date +%F -d "-1 days"` lynx --dump "http://dashb-ssb.cern.ch/dashboard/request.py/getsiteplotdata?site=T2_CH_CSCS&view=Site%20Readiness&time=custom&dateFrom=${YESTERDAY}&dateTo=${TODAY}&prettyprint" | egrep --color '"HC glidein"|"Prod Status"|"Site Readiness"|"Site SAM availability"|"TopologyMaintenances"|"Maintenance saddlebrown"|"Maintenance brown"|"Error"|"Warning"|"OK"|$' 

Tickets vs CSCS

GlideInWMS Jobs

Doc

Global Pool

cms-gwmsmon website requires your X509 in the browser since 09/11/2016

  • https://hypernews.cern.ch/HyperNews/CMS/get/comp-ops/3272.html *
    lxplus109 ~]$ cern-get-sso-cookie --krb -r -u https://gwmsmon-development.cern.ch -o ~/private/ssocookie.txt lxplus109 ~]$ wget -q --load-cookies ~/private/ssocookie.txt https://gwmsmon-development.cern.ch/totalview/json/maxused lxplus109 ~]$ curl -L --cookie ~/private/ssocookie.txt --cookie-jar ~/private/ssocookie.txt https://gwmsmon-development.cern.ch/totalview/json/maxused 

Jobs in the global pool

Here is the amount of total jobs available to run on our sites, can be further split to Analysis or Production.

Debugging The CRAB3 Jobs Logs

By cms-gwmsmon

By the 'User Web Directories' links published on https://cms-gwmsmon.cern.ch/analysisview/T2_CH_CSCS is possible to debug the CRAB3 Jobs Logs to their greatest detail ; regrettably not all of these 'User Web Directories' are directly accessible from Internet because of the CERN FW rules ; again regrettably these links present ALL the Jobs Logs without a mean to filter only the T2_CH_CSCS jobs ; on 23rd May 2016 the Internet accessible/blocked table was :

From Internet Only from lxplus by lynx / elinks / firefox / ...
http://submit-5.t2.ucsd.edu/CSstoragePath/?C=M;O=D  
  http://vocms0109.cern.ch/?C=M;O=D=
http://submit-4.t2.ucsd.edu/CSstoragePath/?C=M;O=D  
  http://vocms066.cern.ch/?C=M;O=D=
  http://vocms059.cern.ch/?C=M;O=D=
http://vocms0114.cern.ch/?C=M;O=D CERN FW misconfigured!
  http://vocms095.cern.ch/?C=M;O=D
  http://vocms021.cern.ch/?C=M;O=D

By SSH / curl

A trick to browse the previous hidden 'User Web Directories' links consists in opening 2 different terminals on a UI, in the 1st terminal login at CERN by ssh -D 12345 YOURACCOUNT@lxplus.cern.ch and then in the 2nd terminal : More... Close

$ curl --socks5 localhost:12345  --silent --stderr - http://vocms0109.cern.ch/cmsprd/160424_091722:sciaba_crab_HC-98-T2_CH_CSCS-27569-20160423050904/job_out.1.0.txt | head 
======== gWMS-CMSRunAnalysis.sh STARTING at Sun Apr 24 10:40:03 GMT 2016 on wn84.lcg.cscs.ch ========
Local time : Sun Apr 24 12:40:03 CEST 2016
Current system : Linux wn84.lcg.cscs.ch 2.6.32-573.12.1.el6.x86_64 #1 SMP Tue Dec 15 08:24:23 CST 2015 x86_64 x86_64 x86_64 GNU/Linux
Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache ...

By the CMS Dashboard

Given the CRAB3 jobs ran at CSCS in a certain period http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=500&activemenu=1&usr=&site=T2_CH_CSCS&submissiontool=crab3 we can retrieve the https://cmsweb.cern.ch/scheddmon links to be read again by curl ; internally these links are ordinary symbolic links to the 'User Web Directories' links cited in the previous section with the difference that are put behind a common portal https://cmsweb.cern.ch/scheddmon/ AND they require your X509 :

$ curl --socks5 localhost:12345   --stderr - --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://cmsweb.cern.ch/scheddmon/0114/cms1702/160524_010723:zhangj_crab_l1-integration-v58p0_MC2015__SingleNeutrino_25nsPU10/job_out.1.2.txt | less

HammerCloud ( CRAB3 tests )

arc0[1-3] + arcbrisi status/stats split by Factory

Reported here as a reference, consult it just if it's really needed :

CERN Plot GOC Plot UCSD Plot
arc01 plot arc01 plot arc01 plot
arc02 plot arc02 plot arc02 plot
arc03 plot arc03 plot arc03 plot
arcbrisi plot arcbrisi plot arcbrisi plot


CMS Nagios

Monitoring the CMS Nagios is useful to check the Failures History ; CMS Nagios Checks Logic / CMS Checks Source :

Nagios old style ( minimalist ) Nagios Check_mk style ( cluttered ) Failures History JSON Python
arc01 arc01 arc01 arc01 arc01
arc02 arc02 arc02 arc02 arc02
arc03 arc03 arc03 arc03 arc03
arcbrisi arcbrisi arcbrisi arcbrisi arcbrisi
storage01 storage01 storage01 storage01 storage01

Storage

The free space should be at least 300TB.

Here you can check the dCache CMS allocation directly:

ssh ui.lcg.cscs.ch
./cms_space.sh

PhEDEx

Doc

Stats

Debugging the FTS3 logs

PhEDEx copies the data at CSCS by FTS3 jobs ; a job move >= 1 file ; if there are errors at CSCS the detailed file(s) transfer logs are available on the portal : https://fts3.cern.ch:8449/fts3/ftsmon/#/

An example of a detailed file transfer log is : https://fts412.cern.ch:8449/var/log/fts3/2016-05-24/cmsrm-se01.roma1.infn.it__storage01.lcg.cscs.ch/2016-05-24-0632__cmsrm-se01.roma1.infn.it__storage01.lcg.cscs.ch__803810223__3b06ed4e-2179-11e6-a787-02163e010724

To list the recent completed FTS3 jobs ID ordered by time :

$ LONGOUPUT=" -l " #  <-- if you don't want to see the long outputs then make it empty by LONGOUPUT=""
$ cd /lhome/phedex/state/Prod/incoming/download-cms02/archive
$ export X509_USER_PROXY=/lhome/phedex/gridcert/proxy.cert
$ find . -printf "%T@ %Tc %p\n"  | sort -n | grep xferinfo  | cut -d'/' -f2,3 | xargs -iI grep status ./I | sed "s#glite-transfer-status -l  #glite-transfer-status $LONGOUPUT#" | uniq | bash -x

To list your current FTS3 jobs : More... Close

$ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://fts3.cern.ch:8446/whoami
{"dn": ["/DC=EU/DC=EGI/C=CH/O=People/O=Paul-Scherrer-Institut (PSI)/CN=Fabio Martinelli", "/DC=EU/DC=EGI/C=CH/O=People/O=Paul-Scherrer-Institut (PSI)/CN=Fabio Martinelli/CN=proxy"], "vos_id": ["d5bdc1ae-600f-58dd-a94f-5c16b07974fd", "fb4bc86a-6738-5c53-bb11-206717a994e7"], "roles": [], "delegation_id": "5075946ec4d75f8c", "user_dn": "/DC=EU/DC=EGI/C=CH/O=People/O=Paul-Scherrer-Institut (PSI)/CN=Fabio Martinelli", "level": {"transfer": "vo"}, "is_root": false, "base_id": "01874efb-4735-4595-bc9c-591aef8240c9", "vos": ["cms", "cms/chcms"], "voms_cred": ["/cms/Role=NULL/Capability=NULL", "/cms/chcms/Role=NULL/Capability=NULL"], "method": "certificate"}[martinelli_f@t3ui19 ~]$ 

$ curl  --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://fts3.cern.ch:8446/jobs?dlg_id=5075946ec4d75f8c
[{"cred_id": "5075946ec4d75f8c", "user_dn": "/DC=EU/DC=EGI/C=CH/O=People/O=Paul-Scherrer-Institut (PSI)/CN=Fabio Martinelli", "retry": 0, "job_id": "1cc653f4-bf27-412b-82b0-138505f5c98e", "cancel_job": false, "job_state": "ACTIVE", "submit_host": "fts410.cern.ch", "priority": 1, "source_space_token": "", "reuse_job": "N", "job_metadata": "", "source_se": "srm://cms-se0.kipt.kharkov.ua", "user_cred": "", "max_time_in_queue": null, "source_token_description": null, "job_params": "", "bring_online": -1, "reason": null, "space_token": "", "submit_time": "2016-05-26T14:01:54", "retry_delay": 0, "dest_se": "srm://storage01.lcg.cscs.ch", "internal_job_params": "", "finish_time": null, "verify_checksum": false, "vo_name": "cms", "copy_pin_lifetime": -1, "agent_dn": null, "job_finished": null, "overwrite_flag": false},{"cred_id": "5075946ec4d75f8c", ...

$ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://fts3.cern.ch:8446/jobs  | sed -e 's/[{}]/''/g' |      awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'| grep --color mangano -A 28 -B 1

From / To Links Status

CLI

  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY "https://cmsweb.cern.ch/phedex/datasvc/json/prod/links?from=T2_CH_CSCS" 2>/dev/null | python -m json.tool | egrep --color 'status|$'
  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY "https://cmsweb.cern.ch/phedex/datasvc/json/prod/links?to=T2_CH_CSCS" 2>/dev/null | python -m json.tool | egrep --color 'status|$'

Datasets Transfer Requests

WEB

CLI
  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://cmsweb.cern.ch/phedex/datasvc/xml/prod/transferrequests?node=T2_CH_CSCS 2>/dev/null | xmllint --format - 
  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY "https://cmsweb.cern.ch/phedex/datasvc/json/prod/transferrequests?node=T2_CH_CSCS" 2>/dev/null | python -m json.tool

Datasets Deployed

WEB

CLI
  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY https://cmsweb.cern.ch/phedex/datasvc/xml/prod/blockreplicasummary?node=T2_CH_CSCS 2>/dev/null | xmllint --format - 
  • $ curl --capath /etc/grid-security/certificates -E $X509_USER_PROXY --cacert $X509_USER_PROXY "https://cmsweb.cern.ch/phedex/datasvc/json/prod/blockreplicasummary?node=T2_CH_CSCS" 2>/dev/null | python -m json.tool 

Datasets Removal

To be checked seldom
The CMS datasets asked by the Swiss users have to to be regularly deleted both at CSCS and at PSI especially if the /pnfs space gets few : CmsTier3/DataSetCleaningQuery
Another way to identify what PhEDEx left / couldn't download is by http://t3serv001.mit.edu/~cmsprod/ConsistencyChecks/home.html

Dark Data

To be checked seldom
Often there are files in /store/ not known by PhEDEx ; they have to be identified by the tool StorageConsistencyCheck and probably deleted :
More... Close

[root@storage02:~]# psql -U postgres -d chimera  -c " select path from v_pnfs where path like '/pnfs/lcg.cscs.ch/cms%' ; " -t -q  -o ./CSCS.txt
[root@storage02:~]# scp -p CSCS.txt phedex@cms02:
[phedex@cms02 ]$ source  /lhome/phedex/PHEDEX/etc/profile.d/init.sh
[phedex@cms02 ]$ /lhome/phedex/sw/slc6_amd64_gcc481/cms/PHEDEX/4.1.7/Utilities/StorageConsistencyCheck -db /lhome/phedex/config/DBParam.CSCS:Prod/CSCS -lfnlist /lhome/phedex/CSCS.txt  -node  T2_CH_CSCS > CSCS.txt.StorageConsistencyCheck.out 2>&1


The output CSCS.txt.StorageConsistencyCheck.out is a list of files known to the SE but not to PhEDEx
More... Close

[phedex@cms02 ]$ egrep ".root$"  CSCS.txt.StorageConsistencyCheck.out | grep -v "/store/[user|group]"
/store/CSA07/skim/2007/11/15/CSA07-CSA07JetMET-Gumbo-B1-PDJetMET_Skims1/0007/06660BB3-159B-DC11-8323-001A92971AAA.root
/store/CSA07/skim/2007/11/15/CSA07-CSA07JetMET-Gumbo-B1-PDJetMET_Skims1/0007/0A82A9B3-159B-DC11-B3D3-001A92810ADE.root
... 
[phedex@cms02 ]$ egrep ".root$" CSCS.txt.StorageConsistencyCheck.out | grep -v "/store/[user|group]" -c
15645

Dynamic Data Management ( DDM ) stats

PSI Proxy renewal once every year

More... Close

# On a T3 UI, upload the proxy on the myproxy.cern.ch server and check if it's really there
t3ui12> myproxy-init -s myproxy.cern.ch -l psi_t3cmsvobox_phedex_joosep_2016 -x  -k renewable -R "*CN=t3cmsvobox.psi.ch" -v -c 8700
t3ui12> myproxy-info -v -s myproxy.cern.ch --username psi_t3cmsvobox_phedex_joosep_2016 -k renewable

# On PSI vobox
t3cmsvobox> /home/phedex/gridcert/proxy.cert # <-- copy here a Joosep's proxy by scp or simply copy/paste

CSCS Proxy renewal once every year

More... Close

lxplus> voms-proxy-init -voms cms -valid 192:00
lxplus> voms-proxy-info
lxplus> myproxy-init -s myproxy.cern.ch -l cscs_cms02_phedex_jpata_2017 -x  -k renewable -R "*CN=cms02.lcg.cscs.ch" -v -c 8700
lxplus> myproxy-info -v -s myproxy.cern.ch --username cscs_cms02_phedex_jpata_2017 -k renewable
lxplus> cp `voms-proxy-info | grep path | awk '{print $3}'` ~/x509_cms02
cms02> rsync jpata@lxplus.cern.ch:~/x509_cms02 /home/phedex/gridcert/x509_new


XROOTD

Availability monitoring

Transfers monitoring More... Close
Low level debugging
  • Is CSCS in the Prod Fed ?
    [cms02] xrdmapc --list all xrdcmsglobal01.cern.ch:1094 2>&1 | grep cscs Srv cms01.lcg.cscs.ch:1094  Srv cms02.lcg.cscs.ch:1094  Srv cms01.lcg.cscs.ch:1094  Srv cms02.lcg.cscs.ch:1094  Srv cms01.lcg.cscs.ch:1094  Srv cms02.lcg.cscs.ch:1094 [cms02] xrdmapc --list all cms-xrd-transit.cern.ch:1094 2>&1 | grep cscs [cms02] echo $? 1 <-- OK!!! 
    *xrootd tests :
    • Browsing
    • $ xrdfs cms01.lcg.cscs.ch ls -l -u /store/mc/RunIIFall15MiniAODv2/
    • $ xrdfs cms02.lcg.cscs.ch ls -l -u /store/mc/RunIIFall15MiniAODv2/
    • Downloading
    • $ xrdcp --debug 1 -f root://cms01.lcg.cscs.ch//store/data/Run2015D/Charmonium/AOD/16Dec2015-v1/50000/8672E121-8CAE-E511-8B85-0025905C42FE.root /dev/null
    • $ xrdcp --debug 1 -f root://cms02.lcg.cscs.ch//store/data/Run2015D/Charmonium/AOD/16Dec2015-v1/50000/8672E121-8CAE-E511-8B85-0025905C42FE.root /dev/null
  • Other simpler netcat ( nc ) checks that have to succeed from any network ( try them only if the previous tests failed ) :
    • $ nc -w 5 -z cms01.lcg.cscs.ch 1094 Expected Output : Connection to cms01.lcg.cscs.ch 1094 port [tcp/rootd] succeeded!
    • $ nc -w 5 -z cms02.lcg.cscs.ch 1094 Expected Output : Connection to cms02.lcg.cscs.ch 1094 port [tcp/rootd] succeeded!
    • $ nc -w 5 -z storage01.lcg.cscs.ch 1095 Output : Connection to storage01.lcg.cscs.ch 1095 port [tcp/nicelink] succeeded!
    • they proof that the servers firewalls are not stopping the xrootd connections AND that there is really a service listening on those servers:ports


SQUID


Low level debugging

VOfeed NEW

CSCS ARC CEs + SE have to be present on http://dashb-cms-vo-feed.cern.ch/dashboard/request.py/cmssitemapbdii
Reference : https://twiki.cern.ch/twiki/bin/view/EGEE/VOTagsVal

Grid services have to be available in the Top BDII

CSCS ARC CEs + SE have to be present on bdii-fzk.gridka.de ; to check :

ldapsearch -x -H ldap://bdii-fzk.gridka.de:2170 -b Mds-Vo-name=CSCS-LCG2,Mds-Vo-name=local,o=grid


CMS Central Services Status

To be checked if something is wrong in our site :
https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::CMS

SiteDB info

To be checked seldom :

Explanation of the logics

To be checked seldom :

GGUS CMS ticket creation

T2 cms02 VOBox installation doc

Nowadays the CMS VO-box is managed by the CSCS puppet by their admin team.

The old recipe is here:


This topic: LCGTier2 > WebHome > CMSInfoPages > CMSMonitoring
Topic revision: r95 - 2017-02-22 - JoosepPata
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback