Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

09. 06. 2016 dCache 2.15 stuck on t3se01

BE AWARE OF THE LATEST 2.15 Derek's tools https://github.com/fabiomartinelli/dcache-shellutils

  • today t3se01 had a high load
  • by running lsof  | grep java I've seen >100 gsiftp connections coming from t3ui17
  • I've connected to t3ui17
  • again by lsof I've realized it was gperrin and his gfalFS vs srm://t3se01.psi.ch
  • I've killed his gfalFS mount point
  • but t3se01 was still stuck because dCache already went in Out of Memory and it was not able to recover frown
  • t3se01 dCache logs are :
  • [root@t3se01 ~]# dcache services 
    DOMAIN                SERVICE          CELL                  LOG                                       
    t3se01-Domain-dcap    dcap             DCap-t3se01           /var/log/dcache/t3se01-Domain-dcap.log    
    t3se01-Domain-gsidcap dcap             DCap-gsi-t3se01       /var/log/dcache/t3se01-Domain-gsidcap.log 
    t3se01-Domain-gsiftp  ftp              GFTP-t3se01           /var/log/dcache/t3se01-Domain-gsiftp.log  
    t3se01-Domain-srm     srm              SRM-t3se01            /var/log/dcache/t3se01-Domain-srm.log     
    t3se01-Domain-srm     spacemanager     SpaceManager          /var/log/dcache/t3se01-Domain-srm.log     
    t3se01-Domain-srm     transfermanagers RemoteTransferManager /var/log/dcache/t3se01-Domain-srm.log     
    t3se01-Domain-utility pinmanager       PinManager            /var/log/dcache/t3se01-Domain-utility.log 
    t3se01-Domain-info    info             info                  /var/log/dcache/t3se01-Domain-info.log    
    t3se01-Domain-xrootd  xrootd           Xrootd-t3se01         /var/log/dcache/t3se01-Domain-xrootd.log  
    dCacheDomain          poolmanager      PoolManager           /var/log/dcache/dCacheDomain.log          
    dCacheDomain          topo             topo                  /var/log/dcache/dCacheDomain.log
  • you check theme in parallel by :
  • [root@t3se01 ~]# dcache services  | grep log | awk '{print $4}'  | xargs -iI tail -v  I 
    ==> /var/log/dcache/t3se01-Domain-dcap.log <==
    09 Jun 2016 10:51:30 (DCap-t3se01--AAU01IboGLA) [door:DCap-t3se01--AAU01IboGLA] Executing command: 3 0 client open "dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/cheidegg/sea/11/2016-06-01-17-17-00/QCD_Pt80to120_EMEnriched.root" r t3ui17.psi.ch 45550 -timeout=-1 -onerror=default -passive -uid=609
    ...
    ==> /var/log/dcache/t3se01-Domain-gsidcap.log <==
    09 Jun 2016 10:47:27 (System) [info] Message arrived : info@t3se01-Domain-info:*@t3se01-Domain-info:*@dCacheDomain];D=[>System@t3se01-Domain-gsidcap];C=java.lang.String;O=<1465462047010:54631>;LO=<1465462047010:54630>;TTL=1000>
    ...
    
  • eventually I've restarted dCache on t3se01 by
    dache restart

GENERAL ADVICES :

  • recall that you can check the live file transfers by :
  • lynx --dump -width=200 http://t3dcachedb.psi.ch:2288/context/transfers.html
  • Door                 Domain       Seq  Prot  Owner  Proc                 PnfsId                    Pool           Host               Status           Since      S    Trans. (KB) Speed (KB/s)
    DCap-t3se01--AAU00-Wza3A t3se01-Domain-dcap 3   dcap-3 521   7033   00005283B084FFAD4845B6719436AAF7DC2A t3fs14_cms_1  192.33.123.93  WaitingForDoorTransferOk 00:35:08 RUNNING 10850796    5145
    DCap-t3se01--AAU00-XuScA t3se01-Domain-dcap 829 dcap-3 621   18767  00002FB6919E4FBB4FBBB9C0FAD2BC8BDA33 t3fs03_cms    192.33.123.139 WaitingForDoorTransferOk 00:00:20 RUNNING 315556      15398
    ...
    
  • How to check the failed SRM operations :
  • $ alias dcache
    alias dcache='ssh -2 -l admin -p 22224 t3dcachedb.psi.ch'
  • $ dcache
    [t3dcachedb03] (local) admin > \c SRM-t3se01
    [t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > print srm counters
    SRMServerV2                           requests    failed
      SrmAbortFilesRequest                      18         0
      SrmAbortRequestRequest                   624         1
      SrmCopyRequest                            24         0
      SrmLsRequest                           46135      1433
      SrmMkdirRequest                          313        71
      SrmPingRequest                         21094         0
      SrmPrepareToGetRequest                   797         0
      SrmPrepareToPutRequest                 21219         0
      SrmPutDoneRequest                      21097         0
      SrmReleaseFilesRequest                   699         0
      SrmRmRequest                           21295     21033
      SrmStatusOfCopyRequestRequest           5286         5
      SrmStatusOfGetRequestRequest              27         0
      Total                                 138628     22543
    diskCacheV111.srm.dcache.Storage      requests    failed
      abortPut(SRMUser|String|URI|String       127         1
      getFileMetaData(SRMUser|URI|boolea     46135      1432
      getFromRemoteTURL(SRMUser|URI|Stri       549         0
      getGetTurl(SRMUser|URI|String[]|UR       797         0
      getPutTurl(SRMUser|String|String[]     21219         0
      killRemoteTransfer(String)               483         0
      listDirectory(SRMUser|URI|boolean|       174         1
      pinFile(SRMUser|URI|String|long|St       797         0
      prepareToPut(SRMUser|URI|Long|Stri     21768         0
      putDone(SRMUser|String|URI|boolean     21640         0
      unPinFile(SRMUser|String|String)         797         0
      Total                                 114486      1434
    SRMServerV2                               average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
      SrmAbortRequestRequest                     4.30±      0.29            0           57         7.23          624     17 hours
      SrmCopyRequest                            61.79±      8.23           32          203        40.31           24     16 hours
      SrmLsRequest                               9.48±      0.04            3          651         9.37       46,135     17 hours
      SrmMkdirRequest                           13.57±      2.66            3          660        47.01          313     17 hours
      SrmPrepareToGetRequest                    46.04±      2.91           18        1,002        82.10          797     16 hours
      SrmPrepareToPutRequest                    14.71±      0.10            6          606        14.16       21,219     17 hours
      SrmPutDoneRequest                         18.92±      0.56           10       11,852        81.85       21,097     17 hours
      SrmReleaseFilesRequest                     0.64±      0.05            0           31         1.29          699     16 hours
      SrmRmRequest                             114.86±     18.71            3       71,399     2,730.74       21,295     17 hours
      SrmStatusOfCopyRequestRequest              1.65±      0.26            0          884        18.77        5,277     16 hours
      SrmStatusOfGetRequestRequest               0.37±      0.11            0            2         0.56           27     15 hours
    diskCacheV111.srm.dcache.Storage          average±stderr(ms)      min(ms)      max(ms)      STD(ms)      Samples       Period
      abortPut(SRMUser|String|URI|String        13.46±      0.67            6           56         7.56          127     17 hours
      getFileMetaData(SRMUser|URI|boolea         8.62±      0.04            3          631         8.73       46,135     17 hours
      getFromRemoteTURL(SRMUser|URI|Stri       368.74±     12.17           10        1,432       285.26          549     16 hours
      getGetTurl(SRMUser|URI|String[]|UR         0.15±      0.01            0            1         0.36          797     16 hours
      getPutTurl(SRMUser|String|String[]         0.14±      0.00            0           36         0.64       21,219     17 hours
      killRemoteTransfer(String)                 0.00±      0.00            0            1         0.06          483     17 hours
      listDirectory(SRMUser|URI|boolean|        19.05±      1.23            0          183        16.24          174     16 hours
      pinFile(SRMUser|URI|String|long|St       128.90±     53.38           17       29,885     1,507.05          797     16 hours
      prepareToPut(SRMUser|URI|Long|Stri        15.46±      0.12            6          601        18.02       21,768     17 hours
      putDone(SRMUser|String|URI|boolean        18.43±      0.55           10       11,852        80.77       21,640     17 hours
      unPinFile(SRMUser|String|String)           7.17±      0.12            4           48         3.28          797     16 hours
    

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2016-06-10 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback