Tags:
create new tag
view all tags
<!-- keep this as a security measure:
* Set ALLOWTOPICCHANGE = TWikiAdminGroup,Main.LCGAdminGroup
* Set ALLOWTOPICRENAME = TWikiAdminGroup,Main.LCGAdminGroup
#uncomment this if you want the page only be viewable by the internal people
#* Set ALLOWTOPICVIEW = TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

Action Items

Action items

Legend: <number.> title (added: date, done:date) NEW / Processing / UPDATED / DONE

Last update 12.03.2020

1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)Processing

(Box = CHIPP allocated nodes at CSCS)

  • IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)

    • Fair share + optimized priority with reservations

    • When a VO comes back will take a higher priority until it gets back to its target, then go back to normal

    • Align the boundaries at the node level (see [1] below in item 3.)

    • IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).

    • PRELIMINARY NUMBERS to seize the shared resources:

      • CMS 50%

      • ATLAS 50%

      • LHCb 50%

    • The final results will be clear when CMS will manage to send a proper flow of production jobs
  • [19.05] ATLAS/CMS down at node
  • Need the final discussion to decide if the Reservations are useful or not
2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:20.02.2020)DONE

- Discussion to be started with M. DeLorenzi and CSCS CTO*

  • START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”

- forced draining of nodes already in use “capped”

  • START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- with short jobs “backfilling”

- (jobs has to be already in the queue - it cannot be detected)

- "Opportunistic was never really on the agenda from the CHIPP/SNF side".

3. Help reducing Cache occupancy (added: 16.01.2020, done:)Processing

At the moment we run all VOs in one node, i.e. 3 stacks of software in one node

- go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1)

- check after the discussion on Reservations (see item 2)

4. Site Log (added: 16.01.2020, done:20.02.2020)DONE

IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it

Done: Collect in a twiki Hot Topics the list of the most relevant changes from both CSCS and the Experiment side that can affect operations

4.b ARC CE Log Access (added: 19.05.2020) Processing

Investigate propagating ARC CE logs for external access

4.c ARC Configuration on Wiki(added: 19.05.2020) Processing

Requires Wiki login --> fix access permission

  • Mauro ask Derek

5. ATLAS --nice (added: 16.01.2020, updated:12.03.2020)UPDATED

TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]

→ Give Gianfranco access to login on Daint and use sprior: login given (20.02.2020)

Waiting Gianfranco for the go ahead

6. ARC metrics (added: 16.01.2020, updated:20.02.2020)Processing

→ check if possible to plug the monitoring package in elastic

On Nick's todo list: Capture ARC state counts and display in dashboard

7. Queue status hammerclouds (added: 16.01.2020, updated:12.03.2020)UPDATED

→ input from VOreps (provide the API call) to CSCS and then put on the dashboard . The idea is to see in one page with a couple of boes the status of the systems.

LHCb provided the query

CMS trying to get to the information (Vinzenz/Derek): Get as an example the script from LHCb

ATLAS there is no single query that can provide the status. Gianfranco provided in the past some logic to Miguel. Investigatin how to get to a single script that produces one (or a few) binary or semaphore (R/Y/G) outputs

8. dCache updates (added: 16.01.2020, updated:20.02.2020updated:19.05.2020 )DONE

→ Ask Dario to present plans for dCache at the next ops meeting.

In the process of configuring a dCache Test Lab:

- Goal is to match the production system and test the upgrade steps

- Will be able to spot problems before applying changes to the production system

- Also allows for optimizing upgrades to complete more efficiently

- Evaluating the change logs of previous versions to spot changes and adapt the configuration

- Upgrade from 3.2 to 5.2 on March 24 or 26 follow on the hot topics page.

9. ATLAS transition to federated resources (added: 16.01.2020, done:) Processing

ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.

--> Update from Gianfranco

10. ARC upgrade (added: 12.03.2020, done:19.05.2020) DONE

Upgrade from 5.4.4 to 6.5 in April follow on the hot topics page

11. ATLAS GPU Tests (added: 19.05.2020)Processing

1769 jobs start in April

Project consumed 102 hours in April

--> Update from Gianfranco

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2020-06-11 - MauroDonega
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback