CHIPP-CSCS Face to Face Meeting on 2019-09-13
- Date and time: Friday 13th of September at 10:15
- Place: Zurich ETHZ (LEE E 126 map)
- External link / EVO: probably not possible
Agenda
- 10:15 - Welcome and agenda
- 10:30 - VO Status report (~last 6 months)
- LHCb (20' - Roland)
- ATLAS (20' - Gianfranco)
- CMS (20' - Vinzenz)
- 11:30 - Tier-2 status, plans & pledges
- CSCS (45' - Various people)
- Long-term resource provisioning overview (15' - Pablo)
- Discussion (30')
- 13:15 - Lunch
- 14:30 - Tier-2 status, plans & pledges
- UNIBE-LHEP (30' - Gianfranco)
- 15:00 - Tier-3 status and plans
- PSI (15' - Nina)
- UNIBE-ID (15' - Gianfranco)
- UNIGE (15' - Gianfranco)
- 15:45 - Coffee break
- 16:00 - NGI_CH (20' - Gianfranco)
- 16:30 - End of meeting
Attendants
Minutes
Please check ALSO the action items, and the attachments with the individual reports.
# LHCb
- Check 2500 job aparent limit
- Higher failure rate compared to other sites (still low, not important)
- SAM tests are VERY bad, but nobody seems to care either
- Re-calculate HS06 with Singularity. Don't publish results in the middle of the month.
# CMS
- CSCS is low in the performance rank (job success rate)
- CMS Vo Box should be updated.
- CMS should check if Phedex will be needed after they adopt Rucio
- Some charts might be wrong: Vinzenz will recheck them soon
# ATLAS
- Generally CSCS is doing well (below the pledge mark, but very close)
- IF a service is down for more than 12 hours, one should declare a downtime
(you can declare a single service down) and send a message on the chat
- Discussion regarding the 40:40:20 shares - will be addressed in the context of next year's grant application and pledges
- discussion on involvement of the VOs in monthly meetings
- Migration to ARC 6 is needed, still not clear how to proceed
- Dashboard: add "cumulative CPU utilization per VO (pie chart Nick gave Gianfranco on the chat" ; + "hammercloud state (green/red box) per VO"
# Long-term provisioning:
- Two goals for VO-reps to inquire with the VOs:
a) increase usage of accelerators (GPUs atCSCS), and
b) inquire about reducing storage (remove dCache?)
- CSCS to inquire internally how to deploy FPGAs
- Meet at the end of October/Nov and discuss what we found out from others
- LHCb does not use CSCS as cache, but as real Storage (with 1 replica somewhere else)
Action items
- Roland - LHCb
- check the y-axis (jobs submitted). It looks inconsistent
- Vinzenz - CMS
- the monitoring moved to grafana. Re-check the plots shown; compare grafana with CMS-dashboard.
- General:
- modify/optimize slurm for proper job distributions (to avoid the skewing which happened)
(remember CMS was low at 33%, ATLAS high at 41%; LHCbup at 26%)
- re-call: all V0-contacts and CSCS MUST check the dashboards daily for ALL three experiments (and report monthly );
in case of serious problems (one VO missing, not submitting jobs etc.)-> report immediately to the corresponding VO --> tickets
- Monitor running vs. installed cores;
- monitor VO-shares
- Batch vs. Dashboard metrics unification
- Understand the differences in quoted numbers for "delivered resources", which are available on 1) GRAFANA, 2) accounting-next.egi.eu/wlcg/report/tier2/ and 3) the VO-specific dashboards;
- VO-status box: get one per VO ?
- all VOs: discuss internally possible consequences if storage in general was reduced or even abandoned; strategies with respect to accelerators.
Attachments