Tags: view all tags

Action Items

Action items

Legend: <number.> title (added: date, done:date) / / /

Last update 12.03.2020

1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)

(Box = CHIPP allocated nodes at CSCS)

IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)
- Fair share + optimized priority with reservations
- When a VO comes back will take a higher priority until it gets back to its target, then go back to normal
- Align the boundaries at the node level (see [1] below in item 3.)
- IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).
- PRELIMINARY NUMBERS to seize the shared resources:
  - CMS 50%
  - ATLAS 50%
  - LHCb 50%
- The final results will be clear when CMS will manage to send a proper flow of production jobs
[19.05] ATLAS/CMS down at node
Need the final discussion to decide if the Reservations are useful or not

2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:20.02.2020)

- Discussion to be started with M. DeLorenzi and CSCS CTO*

START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”

- forced draining of nodes already in use “capped”

START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- with short jobs “backfilling”

- (jobs has to be already in the queue - it cannot be detected)

- "Opportunistic was never really on the agenda from the CHIPP/SNF side".

3. Help reducing Cache occupancy (added: 16.01.2020, done:)

At the moment we run all VOs in one node, i.e. 3 stacks of software in one node

- go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1)

- check after the discussion on Reservations (see item 2)

4. Site Log (added: 16.01.2020, done:20.02.2020)

IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it

Done: Collect in a twiki Hot Topics the list of the most relevant changes from both CSCS and the Experiment side that can affect operations

4.b ARC CE Log Access (added: 19.05.2020)

Investigate propagating ARC CE logs for external access

4.c ARC Configuration on Wiki(added: 19.05.2020)

Requires Wiki login --> fix access permission

Mauro ask Derek

5. ATLAS --nice (added: 16.01.2020, updated:12.03.2020)

TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]

→ Give Gianfranco access to login on Daint and use sprior: login given (20.02.2020)

→ Waiting Gianfranco for the go ahead

6. ARC metrics (added: 16.01.2020, updated:20.02.2020)

→ check if possible to plug the monitoring package in elastic

On Nick's todo list: Capture ARC state counts and display in dashboard

7. Queue status hammerclouds (added: 16.01.2020, updated:12.03.2020)

→ input from VOreps (provide the API call) to CSCS and then put on the dashboard . The idea is to see in one page with a couple of boes the status of the systems.

LHCb provided the query

CMS trying to get to the information (Vinzenz/Derek): Get as an example the script from LHCb

ATLAS there is no single query that can provide the status. Gianfranco provided in the past some logic to Miguel. Investigatin how to get to a single script that produces one (or a few) binary or semaphore (R/Y/G) outputs