Swiss WLCG Operations Meeting on 2010-09-09

Date and time: 2010/09/09 at 9:30
Place: EVO, password: chipp
External link / EVO:http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MDMaM82a28DuDs929lD99D

Agenda

Maintenance day report and site status
Status of Lustre problems
Status of new experiment software area
- Derek: After fixing two local problems that were discovered during the installations by the central cms ops team, the CMS sw area is now ready.
- Roland: LHCb sw area is ready.
Status of Phase D
- Derek: We had a visit from Bull at PSI. When discussing different storage possibilities they also mentioned that they are using LSI solutions as the one suggested by CSCS in their slides. They seem to be price competitive and we now made a rather good experience with Bull in respect to a problem resolution with a DDN system that we had bought through them. Maybe one could get an offer for the HW from them (but maybe CSCS has an ETH Rahmenvertrag already for this kind of equipment).
Availability / Reliability values for July (79%) and August (47%). Numbers too low to be real, need to check numbers from VOs.
(Derek): Could CSCS give a short overview over the scheduling policies? Are there still the 100 (or so) reserved job slots for each experiment, so that we can guarantee a certain availability?
Review Action Items
- CSCS: purchase hardware needed for implementing NFS setup
- CSCS: open 3 tickets against Sun support; see ticket #7851
- MG: check with VO to test CREAM CE and give status report; check availability of SAM tests for CREAM-CE
- DF: check availability of SAM tests for CREAM-CE
  - classical CMS SAM test
  - CMS dashboard view
  - No official directive to abandon lcg-CE in favor of CREAM
AOB

Attendants

ATLAS: Marc, Gianfranco
CMS: Derek, Leo
LHCb: Ronald
CSCS: Peter, Pablo

Minutes

During the maintenance we had a network problem with the SE head nodes that caused pools to go away for some time and some transfers were hanging, so we couldn't bring the site back until that was solved.
Lustre is giving trouble with the newest client version (1.8.4) so we decided to downgrade it back to 1.8.3, in a rolling downgrade.
When Atlas finishes copying the software to the NFS server (it's ongoing) we are going to mount it to all WNs and change the environment variables so that new jobs use the new area, but old jobs will still finish using the old one. This can cause some new incoming jobs not being able to find the SW that should be there, but there should not be too many and would not represent a problem.
PhaseD. We are going to ask for an offer to BULL to compare it with the one from IBM before placing the order. Also, we are still working with the numbers from last CHIPP Computing board, will be sent ASAP.
Availability / Reliability. Looks like ARC-CE could be the reason why we have such bad numbers. https://gridview.cern.ch/GRIDVIEW/sa/bin/same_graphs.php?XX=&Information=SiteDetail&DefVO=15&TestVO=-1&DurationOption=daily&LComponent=-2&NodeID=-1&TestID=-1&Hour1=0&StartDay=1&StartMonth=8&StartYear=2010&Hour2=23&EndDay=31&EndMonth=8&EndYear=2010&LTier1Site=12&RelOrAvail=Availability&OnlyCritical=ON&SiteFullName=1&Report=0&LTier2Site[]=12

And also http://lxarda16.cern.ch/dashboard/request.py/historicalsiteavailability?siteSelect3=T2&sites=T2_CH_CSCS&timeRange=individual&start=2010-08-01&end=2010-08-30

Sheduling policies. We have the same reservations as before: 96 cores for Atlas, 96 cores for CMS, and 20 for LHCb.

Action items

CSCS is to open a ticket into GGUS to investigate if the formula to calculate availability/reliability changed, maybe Arc01 problems made the whole site red.
CSCS is also asking Bull for an offer for the storage for PhaseD
CSCS is going to downgrade Lustre clients to 1.8.3