CSCS Operations Meeting on 2016-09-27
- Date and time:
- Place:
- External link / EVO:
Agenda
- First meeting's overview
- Issue list review
- Ticket review
- Maintenances
- Other/AOB
Attendants
- Fabio, Roland, Dino, Dario, Gianni, Miguel, Stefano, Pablo, Luis
- Gianfranco apologizes but sent feedback by email
Minutes
On the task list inside the first (priority) block (before end-of October):
- Requests for offers to SNF are sent and waiting for input from vendors. SNF info input in progress.
- Efficiency problems with scratch is currently treated as an incident that needs to be solved high priority. CMS confirms there is no problem, but confirmation from LHCb and ATLAS is needed
- Regarding ARC config, ATLAS ask (via email) for configuration for arc01-03 and history of changes if available
- We need to see both ATLAS and CMS efficiency plots and compare in order to understand better what happened in the last year(s). Fabio should send CSCS the graph, and CSCS will try to compare both and find commonalities and differences
- Authentication in Kibana with grid certificates is ongoing. Fabio is interested to have temporary access with ssh tunnel, and Dino will tell him how on the chat
- Joining the VOs: CMS is ready, ATLAS is waiting for two names. LHCb reported (before the meeting) that's not needed. The rest of the task (familiarize with the VO dashboards) is rescheduled for the Hackathon period (end of October).
- Gianfranco sent two links for identifying A/R metrics: http://wlcg-sam-atlas.cern.ch/templates/ember/#/historicalsmry/heatMap?group=ATLAS_Cloud_DE&profile=ATLAS_AnalysisAvailability&time=1m&type=Availability%20Ranking%20Plot and http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=DE&site=CSCS-LCG2,CSCS-LCG2_MCORE,ANALY_CSCS,CSCS-LCG2-HPC,CSCS-LCG2-HPC_MCORE,ANALY_CSCS-HPC
Dashboard Hackathon:
- We should not have the hackathon without a clear plan on what is wanted to be done. In absence of a better plan, Pablo asks everyone to provide a list of TWO lights/metrics that they will like to see (that you consider most important) in the dashboard within the next two weeks. Stefano will coordinate the Hackathon.
Regarding the rest of the task list (to be addressed starting in November):
- We need to keep an eye (statistics) on memory utilization and problems derived to memory abuse (e.g. swapping) before imposing limits to jobs. Fabio suggests to impose a maximum of 2xRequiredMem but Pablo insists that could cause other problems and we should not try to solve problems that don't exist (Dino reports nodes are not swapping). This might be a problem, though, for LHConCRAY so the issue will be derived to the project instead.
- The VO-Box discussion (and related tickets) are still waiting for Derek's input. This is getting increasingly important since Fabio is leaving, because the continuity of the CMS vobox needs to be guaranteed. It was agreed to increase the priority of this task by setting a deadline to November instead of December (that will be too tight for Fabio)
- The Slurm reports might be easily fixed: CSCS will re-assess the work that is involved and see if that can easily/quickly be done.
- The "Finalize BDII" task is clarified: it involves changing the HA setting from lbcd to keepalive.
- All the other tasks have no input and experience no change in priority
Regarding open tickets:
- #22368 Chech CSCS status on CMS dashboard. This is top priority for Fabio, but it depends on the decision from Derek (another reason to have a final word)
- #24193 Stalled jobs at CSCS. This is an old ticket from Vladimir and can be closed
- Time ran out and we could not go through all tickets in detail, but VO Reps report there is no burning issue at the moment
Next meeting in two weeks, same day and same time (11th of October at 14:00)
Action items
- ATLAS and LHCb to confirm if the efficiency problems are still there
- CSCS to send Gianfranco ARC config and history of changes if available
- Fabio to send CSCS a graph with efficiency plots with as much as 4 years of history
- Dino to help Fabio use Kibana via ssh tunnel
- CSCS to send Gianfranco the two names to include into the VO
- Derek to send CSCS a reply for the VO-Box proposal. Final implementation finished before end-of November
- CSCS to reassess the work involved in fixing the slurm reports
- CSCS to update the task list with proposed changes
- EVERYONE to provide a list of two lights/metrics that they would like to see in the dashboard before next meeting