Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-08-04 at 14:00

Site status

CSCS

* Xxx * Accounting numbers (from scheduler) from last month * Worked mainly on the issue about the GPFS slowness and lcb-cp problem

    • GPFS Slowness is caused by I/O intensive jobs running simultaneously
    • LCB-CP deprecated command replaced by gfal-copy, changed on site conf by CMS and Atlas
      • lhcb is facing the same issue?
  • Perfsonar01/02 dead for disc failure, both machines reinstalled with Puppet
  • cream[01-03] removed yesterday from BDII and GOCDB, so officially decommissioned. Cream01 and cream03 powerd off today
  • Reintalling BDII with puppet
Accounting numbers July:

VO Cpu Hours
cms 1'793'900.165
atlas 1'118'498.575
lhcb 811'097.677
ops 19.319
TOTAL 3'723'519.013

PSI

  • Accounting numbers (from scheduler) from last month
  • New HW
  • GGUS Tickets vs CSCS
    • Following Failures at T2_CH_CSCS
    • CMS Job gfal-copy call activated because of my recent change command value="gfal2"
    • $ find /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/ /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/ /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/PhEDEx /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/PhEDEx/storage.xml /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/cmsset_local.sh /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/cmsset_local.csh /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/site-local-config.xml <---- 
  • Holidays
    • Previous week I was on leave, next week I'll be on leave too

UNIBE-LHEP

  • Operations

    • Nothing specific to report
  • ATLAS specific operations
    • Nothing specific to report
  • HammerCloud report [1]
    • UNIBE-LHEP online 74% (was 79% last month).
    • UNIBE-ID 97% (this doesn't run the high I/O workloads, but it runs analysis)
    • UNIBE-LHEP_CLOUD* 95%
[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

  • ATLAS resource delivery UNIBE-LHEP vs CSCS-LCG2 [2]
    • All jobs: 47% of ATLAS/CH (WallTime), 78% of ATLAS/CH (CPUtime)
    • Good jobs: 68% of ATLAS CH (WallTime), 84% of ATLAS/CH (CPUtime)
[2] http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All

  • Accounting numbers (from scheduler) for last month (Jul 2016) (includes ce03/CLOUD)
  • WC h: 780748 (ATLAS) - 35044 (t2k.org) - 3289 (uboone) - 12 (ops)
  • UNIBE-ID

    • Change of Resource Manager:
      • ATLAS (ARC-CE) now served by new Slurm server
      • Transition was easy enough, minor quirks in the first couple of hours due to forgotten change to singlenode environment
      • Since then stable operation
      • Rest of the cluster will be moved to Slurm in next maintenance down (2nd Thursday of December) => moew cores again for ATLAS
      • after OG-SGE dumped
    • Operations
      • Very stable operations lately

    UNIGE

    • Operations
      • Back into ATLAS production mode since July 25th:
        • Memory hacked at PBS batch scheduler for running ATLAS production jobs
        • Debugging Multi-Core jobs: Not running successfully yet
      • Running smoothly: Lower user activity due to holidays period
    • Network
      • Upgrade of network swicth (10 Gb/s) for File Systems soon
    • Holidays
      • Next 2 weeks
    • Accounting numbers (from scheduler) from last month

    NGI_CH

    • EGI central monitoring instance (ARGO)

      Since July 1st, the EGI infrastructure is being monitored by two monitoring instances that can be found on these addresses:

      https://argo-mon.egi.eu/nagios
      https://argo-mon2.egi.eu/nagios

      Both instances are running the same set of tests and results provided are equivalent.

      Starting from the same date, the central ARGO Web UI ( http://argo.egi.eu/lavoisier ) provides information from these two instances and the Operations Portal was reconfigured to raise alarms based on information from ARGO central instances.
    • NGI-CH Open Tickets review
      • CSCS
        • 122679 (CMS) timeout in file copy to SE (switch to gfal-copy broke some Nagios tests?)
        • 122486 (ATLAS) expose the full PFN through their xrootd doors => just closed it
        • 122155 (ATLAS) file transfers failing (inconsistent file size & checksum): 14 new files to check (updated today)

      • UNIBE-LHEP
        • 117899 (ATLAS) Storage dumps (on-hold)

    Other topics

    • Topic1
    • Topic2
    Next meeting date:

    A.O.B.

    Attendants

    • CSCS: Dino
    • CMS: Fabio
    • ATLAS: Luis, Gianfranco
    • LHCb:
    • EGI: Gianfranco

    Action items

    • Item1
    Topic attachments
    I Attachment History Action Size Date Who Comment
    Unknown file formatlog g07.201607.log r1 manage 1.1 K 2016-08-04 - 11:25 LuisMarch UniGe - July 2016 stats
    Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
    Topic revision: r9 - 2016-11-11 - MichaelRolli
     
    This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    Ideas, requests, problems regarding TWiki? Send feedback