Tags:
create new tag
view all tags

Prioritized TODO list for Phoenix until February 2017

In order to properly schedule the work derived from our last MeetingCHIPPCSCSFaceToFace20160901 meeting together with all the other activities, we need a complete list of all planned activities. Unplanned tasks (such as incidents) will always take precedence.

The list will be checked every two weeks, including new incidents that may arise.

Priority goes from top to bottom, together with the expected work for 1 person and the proposed deadline.

Ticket Task Duration Deadline Done
  Ask for HW offers for SNF funding request + derived analysis weeks 25 Oct Yes
  Resolve efficiency problems (with Scratch?) ? 10 Oct Yes
  Look at ARC + ATLAS queue configuration to undetstand what happened in April'16 with Scratch high utilization hours 10 Oct needed?
  Cross-check ATLAS and CMS efficiency plots with known issues and understand the differences days 10 Oct Yes
  Join all VOs and start familiarizing with their dashboards/logs days 31 Oct Yes
  Update the RoadMap wiki page hours 25 Oct  
  Create full CSCS+VOs monitoring dashboard (incl. Hackathon) weeks 31 Oct  
  Authentication on Kibana with Grid Certificates hours 31 Oct  
  Identify relevant A/R metrics (and others) for each VO and track them days 31 Oct identified
  _________________ TASK REVIEW _________________      
23008 Discussion and implementation for VOBoxes weeks 30 Nov Yes
  Need to understand the impact of not imposing memory limits (e.g. swapping nodes?) hours Dec LHConCRAY TODO
  DCACHE Update weeks Dec Yes
23515 SLURM reports are broken days Dec  
  Clean Monitoring on Wiki days Feb  
22368 Complete Nagios check with info from the VO and publish in Wiki days Feb  
  Puppetize cvmfs,argus days Feb Partial
  Foreman dismission hours Feb  
  Update Doc on wiki days Feb  
  Implement HA on Argus days Feb  
  Puppetize Storage infrastructure weeks Feb Partial
  Add per-VO walltime usage to accounting plots hours Feb Yes
24518 Check EGI accounting days Feb Yes
  Implement nodehealthcheck days Feb  
23114 Sudo rights on arc0[1,2,3] + arcbrisi days Feb Yes
  Finalize BDII lbcd->keepalive days Feb Yes

Monitoring Dashboard details

This section contains details about the common monitoring dashboard.

For all: please add two metrics that you would like to see in the Dashboard.

ATLAS

  • HC status for each queue: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/<PANDA_QUEUE>.pilot.json"|grep status
    • e.g.: # curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/CSCS-LCG2_MCORE.pilot.json"|grep status
    • Logic AND for all queues is highly likely to result in misleading information. Instead, an alarm from each individual queue should be treated as an incident. We have 3 queues for Phoenix, 2 for Brisi (these should not trigger critical alarms during integration)
    • A time evolution showing the periods during which at least one queue is blacklisted is complementary information that is needed on top of the alarm (http://bigpanda.cern.ch/incidents/)

  • Nr. of cores in each ARC status (available from e.g. gangliarc) vs pledged cores. Only nr. of running gives an incomplete picture in case of operations being not in an optimal state or compromised. All values are needed, as a function of time. A single alarm might be triggered on only nr. of running vs pledged (tbd)
PABLO

  • Real Availability/Reliability metrics from the VO perspective, since the official ones (http://wlcg-sam.cern.ch/reports/2016/201611/wlcg/) were declared by the VO Reps, in our previous F2F meeting, as irrelevant.
  • Efficiency metrics (CPU/Walltime) within the cluster, in the last few days, for each VO. It would be great to have values from both the VO and the Cluster itself, but if we have those from the Cluster (from Slurm?) that's already quite something

Readers' comments

 
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r15 - 2016-12-16 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback