Prioritized TODO list for Phoenix until February 2017
In order to properly schedule the work derived from our last
MeetingCHIPPCSCSFaceToFace20160901 meeting together with all the other activities, we need a complete list of all planned activities. Unplanned tasks (such as incidents) will always take precedence.
The list will be checked every two weeks, including new incidents that may arise.
Priority goes from top to bottom, together with the expected work for 1 person and the proposed deadline.
Monitoring Dashboard details
This section contains details about the common monitoring dashboard.
For all: please add two metrics that you would like to see in the Dashboard.
ATLAS
- HC status for each queue: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/<PANDA_QUEUE>.pilot.json"|grep status
-
- e.g.: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/CSCS-LCG2_MCORE.pilot.json"|grep status
- Logic AND for all queues is highly likely to result in misleading information. Instead, an alarm from each individual queue should be treated as an incident. We have 3 queues for Phoenix, 2 for Brisi (these should not trigger critical alarms during integration)
- A time evolution showing the periods during which at least one queue is blacklisted is complementary information that is needed on top of the alarm (http://bigpanda.cern.ch/incidents/)
- Nr. of cores in each ARC status (available from e.g. gangliarc) vs pledged cores. Only nr. of running gives an incomplete picture in case of operations being not in an optimal state or compromised. All values are needed, as a function of time. A single alarm might be triggered on only nr. of running vs pledged (tbd)
PABLO
- Real Availability/Reliability metrics from the VO perspective, since the official ones (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/) were declared by the VO Reps, in our previous F2F meeting, as irrelevant.
- Efficiency metrics (CPU/Walltime) within the cluster, in the last few days, for each VO. It would be great to have values from both the VO and the Cluster itself, but if we have those from the Cluster (from Slurm?) that's already quite something
Readers' comments