Swiss Grid Operations Meeting on 2016-07-07 at 14:00
Site status
CSCS
- Xxx
- Accounting numbers (from scheduler) from last month
PSI
UNIBE-LHEP
- Xxx
- Accounting numbers (from scheduler) from last month
UNIBE-ID
- Mostly smooth operation
- Procurement:
- 80 new server (76*20 + 4*16 => 1584 new cores; disontinued 144 cores (oldest nodes)
- installed and provisioned
- Migration from OGSGE => Slurm planned for Q4
- Probs with NAMD jobs (using ibverbs directly) => low level IB errors from mlx4 regarding qp
- no errors with MPI jobs using ompi or the like
- no errors with storage (GPFS over RDMA)
- ATLAS specific: large number of random a-rex crashes within the last 2 weeks
- reason unknown, happened 24x between 2016-06-15 and last monday; no crash since 3 days
UNIGE
- Operations
- 10 machines added into the batch system (80 cores) + 3 machines replaced:
- DELL - Intel Xeon @ 2.4 GHz - with 8 cores and 48 GB of memory
- RAID controller: Common problem for our DPM and NFS File servers (It happened like 3/4 times during last months)
- Increased activity from DPNC users to run in the batch system (other groups, in addition to ATLAS)
- Still not in ATLAS production, problems related with memory (hints provided by Gianfranco)
- Data Management:
- Accounting numbers (from scheduler) from last month
NGI_CH
- Xxx
- NGI-CH Open Tickets review
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS:
- ATLAS: Michael Rolli (UNIBE-ID) => absent being ill, nevertheless some text above
- LHCb:
- EGI:
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20160707
Topic revision: r6 - 2016-07-07 - FabioMartinelli