Swiss Grid Operations Meeting on 2016-08-04 at 14:00
Site status
CSCS
* Xxx * Accounting numbers (from scheduler) from last month
* Worked mainly on the issue about the GPFS slowness and lcb-cp problem
-
- GPFS Slowness is caused by I/O intensive jobs running simultaneously
- LCB-CP deprecated command replaced by gfal-copy, changed on site conf by CMS and Atlas
- lhcb is facing the same issue?
- Perfsonar01/02 dead for disc failure, both machines reinstalled with Puppet
- cream[01-03] removed yesterday from BDII and GOCDB, so officially decommissioned. Cream01 and cream03 powerd off today
- Reintalling BDII with puppet
Accounting numbers July:
VO |
Cpu Hours |
cms |
1'793'900.165 |
atlas |
1'118'498.575 |
lhcb |
811'097.677 |
ops |
19.319 |
TOTAL |
3'723'519.013 |
PSI
UNIBE-LHEP
-
Operations
- Nothing specific to report
- ATLAS specific operations
- Nothing specific to report
-
- HammerCloud report [1]
- UNIBE-LHEP online 74% (was 79% last month).
- UNIBE-ID 97% (this doesn't run the high I/O workloads, but it runs analysis)
- UNIBE-LHEP_CLOUD* 95%
[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE
- ATLAS resource delivery UNIBE-LHEP vs CSCS-LCG2 [2]
- All jobs: 47% of ATLAS/CH (WallTime), 78% of ATLAS/CH (CPUtime)
- Good jobs: 68% of ATLAS CH (WallTime), 84% of ATLAS/CH (CPUtime)
[2]
http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All
Accounting numbers (from scheduler) for last month (Jul 2016) (includes ce03/CLOUD) WC h: 780748 (ATLAS) - 35044 (t2k.org) - 3289 (uboone) - 12 (ops)
UNIBE-ID
UNIGE
- Operations
- Back into ATLAS production mode since July 25th:
- Memory hacked at PBS batch scheduler for running ATLAS production jobs
- Debugging Multi-Core jobs: Not running successfully yet
- Running smoothly: Lower user activity due to holidays period
- Network
- Upgrade of network swicth (10 Gb/s) for File Systems soon
- Holidays
- Accounting numbers (from scheduler) from last month
NGI_CH
- EGI central monitoring instance (ARGO)
Since July 1st, the EGI infrastructure is being monitored by two monitoring instances that can be found on these addresses:
https://argo-mon.egi.eu/nagios
https://argo-mon2.egi.eu/nagios
Both instances are running the same set of tests and results provided are equivalent.
Starting from the same date, the central ARGO Web UI ( http://argo.egi.eu/lavoisier ) provides information from these two instances and the Operations Portal was reconfigured to raise alarms based on information from ARGO central instances.
- NGI-CH Open Tickets review
- CSCS
- 122679 (CMS) timeout in file copy to SE (switch to gfal-copy broke some Nagios tests?)
- 122486 (ATLAS) expose the full PFN through their xrootd doors => just closed it
- 122155 (ATLAS) file transfers failing (inconsistent file size & checksum): 14 new files to check (updated today)
-
- UNIBE-LHEP
- 117899 (ATLAS) Storage dumps (on-hold)
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS:
- ATLAS:
- LHCb:
- EGI:
Action items