Tags: view all tags

Prioritized TODO list for Phoenix until February 2017

In order to properly schedule the work derived from our last MeetingCHIPPCSCSFaceToFace20160901 meeting together with all the other activities, we need a complete list of all planned activities. Unplanned tasks (such as incidents) will always take precedence.

The list will be checked every two weeks, including new incidents that may arise.

Priority goes from top to bottom, together with the expected work for 1 person and the proposed deadline.

Ticket	Task	Duration	Deadline	Done
	Ask for HW offers for SNF funding request + derived analysis	weeks	25 Oct	Yes
	Resolve efficiency problems (with Scratch?)	?	10 Oct	Yes
	Look at ARC + ATLAS queue configuration to undetstand what happened in April'16 with Scratch high utilization	hours	10 Oct	needed?
	Cross-check ATLAS and CMS efficiency plots with known issues and understand the differences	days	10 Oct	Yes
	Join all VOs and start familiarizing with their dashboards/logs	days	31 Oct	Yes
	Update the RoadMap wiki page	hours	25 Oct
	Create full CSCS+VOs monitoring dashboard (incl. Hackathon)	weeks	31 Oct
	Authentication on Kibana with Grid Certificates	hours	31 Oct
	Identify relevant A/R metrics (and others) for each VO and track them	days	31 Oct	identified
	*_________________* TASK REVIEW *_________________*
23008	Discussion and implementation for VOBoxes	weeks	30 Nov	Yes
	Need to understand the impact of not imposing memory limits (e.g. swapping nodes?)	hours	Dec	LHConCRAY TODO
	DCACHE Update	weeks	Dec	Yes
23515	SLURM reports are broken	days	Dec
	Clean Monitoring on Wiki	days	Feb
22368	Complete Nagios check with info from the VO and publish in Wiki	days	Feb
	Puppetize cvmfs,argus	days	Feb	Partial
	Foreman dismission	hours	Feb
	Update Doc on wiki	days	Feb
	Implement HA on Argus	days	Feb
	Puppetize Storage infrastructure	weeks	Feb	Partial
	Add per-VO walltime usage to accounting plots	hours	Feb	Yes
24518	Check EGI accounting	days	Feb	Yes
	Implement nodehealthcheck	days	Feb
23114	Sudo rights on arc0[1,2,3] + arcbrisi	days	Feb	Yes
	Finalize BDII lbcd->keepalive	days	Feb	Yes

Monitoring Dashboard details

This section contains details about the common monitoring dashboard.

For all: please add two metrics that you would like to see in the Dashboard.

ATLAS

HC status for each queue: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/<PANDA_QUEUE>.pilot.json"|grep status

- e.g.: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/CSCS-LCG2_MCORE.pilot.json"|grep status
- Logic AND for all queues is highly likely to result in misleading information. Instead, an alarm from each individual queue should be treated as an incident. We have 3 queues for Phoenix, 2 for Brisi (these should not trigger critical alarms during integration)
- A time evolution showing the periods during which at least one queue is blacklisted is complementary information that is needed on top of the alarm (http://bigpanda.cern.ch/incidents/)

Nr. of cores in each ARC status (available from e.g. gangliarc) vs pledged cores. Only nr. of running gives an incomplete picture in case of operations being not in an optimal state or compromised. All values are needed, as a function of time. A single alarm might be triggered on only nr. of running vs pledged (tbd)

PABLO

Real Availability/Reliability metrics from the VO perspective, since the official ones (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/) were declared by the VO Reps, in our previous F2F meeting, as irrelevant.
Efficiency metrics (CPU/Walltime) within the cluster, in the last few days, for each VO. It would be great to have values from both the VO and the Cluster itself, but if we have those from the Cluster (from Slurm?) that's already quite something

Readers' comments