<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 09. 06. 2016 dCache 2.15 stuck on =t3se01= BE AWARE OF THE LATEST 2.15 Derek's tools https://github.com/fabiomartinelli/dcache-shellutils </br></br> * today =t3se01= had a high load * by running =lsof | grep java= I've seen >100 =gsiftp= connections coming from =t3ui17= * I've connected to =t3ui17= * again by =lsof= I've realized it was =gperrin= and his =gfalFS= vs =srm://t3se01.psi.ch= * I've killed his =gfalFS= mount point * but =t3se01= was still stuck because dCache already went in Out of Memory and it was not able to recover :( * =t3se01= dCache logs are : * <pre>[root@t3se01 ~]# dcache services DOMAIN SERVICE CELL LOG t3se01-Domain-dcap dcap DCap-t3se01 /var/log/dcache/t3se01-Domain-dcap.log t3se01-Domain-gsidcap dcap DCap-gsi-t3se01 /var/log/dcache/t3se01-Domain-gsidcap.log t3se01-Domain-gsiftp ftp GFTP-t3se01 /var/log/dcache/t3se01-Domain-gsiftp.log t3se01-Domain-srm srm SRM-t3se01 /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-srm spacemanager SpaceManager /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-srm transfermanagers RemoteTransferManager /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-utility pinmanager PinManager /var/log/dcache/t3se01-Domain-utility.log t3se01-Domain-info info info /var/log/dcache/t3se01-Domain-info.log t3se01-Domain-xrootd xrootd Xrootd-t3se01 /var/log/dcache/t3se01-Domain-xrootd.log dCacheDomain poolmanager PoolManager /var/log/dcache/dCacheDomain.log dCacheDomain topo topo /var/log/dcache/dCacheDomain.log</pre> * you check theme in parallel by : * <pre>[root@t3se01 ~]# dcache services | grep log | awk '{print $4}' | xargs -iI tail %BLUE%-v%ENDCOLOR% I ==> %BLUE%/var/log/dcache/t3se01-Domain-dcap.log%ENDCOLOR% <== 09 Jun 2016 10:51:30 (DCap-t3se01-<unknown>-AAU01IboGLA) [door:DCap-t3se01-<unknown>-AAU01IboGLA] Executing command: 3 0 client open "dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/cheidegg/sea/11/2016-06-01-17-17-00/QCD_Pt80to120_EMEnriched.root" r t3ui17.psi.ch 45550 -timeout=-1 -onerror=default -passive -uid=609 ... ==> %BLUE%/var/log/dcache/t3se01-Domain-gsidcap.log%ENDCOLOR% <== 09 Jun 2016 10:47:27 (System) [info] Message arrived : <CM: S=[>info@t3se01-Domain-info:*@t3se01-Domain-info:*@dCacheDomain];D=[>System@t3se01-Domain-gsidcap];C=java.lang.String;O=<1465462047010:54631>;LO=<1465462047010:54630>;TTL=1000> ... </pre> * %RED%eventually I've restarted dCache on =t3se01= by <pre>dache restart</pre>%ENDCOLOR% GENERAL ADVICES : </br> * recall that you can check the live file transfers by : * <pre>lynx --dump -width=200 http://t3dcachedb.psi.ch:2288/context/transfers.html</pre> * <pre>Door Domain Seq Prot Owner Proc PnfsId Pool Host Status Since S Trans. (KB) Speed (KB/s) DCap-t3se01--AAU00-Wza3A t3se01-Domain-dcap 3 dcap-3 521 7033 00005283B084FFAD4845B6719436AAF7DC2A t3fs14_cms_1 192.33.123.93 WaitingForDoorTransferOk 00:35:08 RUNNING 10850796 5145 DCap-t3se01--AAU00-XuScA t3se01-Domain-dcap 829 dcap-3 621 18767 00002FB6919E4FBB4FBBB9C0FAD2BC8BDA33 t3fs03_cms 192.33.123.139 WaitingForDoorTransferOk 00:00:20 RUNNING 315556 15398 ... </pre> * How to check the failed SRM operations : * <pre>$ alias dcache alias dcache='ssh -2 -l admin -p 22224 t3dcachedb.psi.ch'</pre> * <pre>$ dcache [t3dcachedb03] (local) admin > \c SRM-t3se01 [t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > print srm counters SRMServerV2 requests failed SrmAbortFilesRequest 18 0 SrmAbortRequestRequest 624 1 SrmCopyRequest 24 0 SrmLsRequest 46135 1433 SrmMkdirRequest 313 71 SrmPingRequest 21094 0 SrmPrepareToGetRequest 797 0 SrmPrepareToPutRequest 21219 0 SrmPutDoneRequest 21097 0 SrmReleaseFilesRequest 699 0 SrmRmRequest 21295 %RED%21033%ENDCOLOR% SrmStatusOfCopyRequestRequest 5286 5 SrmStatusOfGetRequestRequest 27 0 Total 138628 22543 diskCacheV111.srm.dcache.Storage requests failed abortPut(SRMUser|String|URI|String 127 1 getFileMetaData(SRMUser|URI|boolea 46135 1432 getFromRemoteTURL(SRMUser|URI|Stri 549 0 getGetTurl(SRMUser|URI|String[]|UR 797 0 getPutTurl(SRMUser|String|String[] 21219 0 killRemoteTransfer(String) 483 0 listDirectory(SRMUser|URI|boolean| 174 1 pinFile(SRMUser|URI|String|long|St 797 0 prepareToPut(SRMUser|URI|Long|Stri 21768 0 putDone(SRMUser|String|URI|boolean 21640 0 unPinFile(SRMUser|String|String) 797 0 Total 114486 1434 SRMServerV2 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period SrmAbortRequestRequest 4.30± 0.29 0 57 7.23 624 17 hours SrmCopyRequest 61.79± 8.23 32 203 40.31 24 16 hours SrmLsRequest 9.48± 0.04 3 651 9.37 46,135 17 hours SrmMkdirRequest 13.57± 2.66 3 660 47.01 313 17 hours SrmPrepareToGetRequest 46.04± 2.91 18 1,002 82.10 797 16 hours SrmPrepareToPutRequest 14.71± 0.10 6 606 14.16 21,219 17 hours SrmPutDoneRequest 18.92± 0.56 10 11,852 81.85 21,097 17 hours SrmReleaseFilesRequest 0.64± 0.05 0 31 1.29 699 16 hours SrmRmRequest 114.86± 18.71 3 71,399 2,730.74 21,295 17 hours SrmStatusOfCopyRequestRequest 1.65± 0.26 0 884 18.77 5,277 16 hours SrmStatusOfGetRequestRequest 0.37± 0.11 0 2 0.56 27 15 hours diskCacheV111.srm.dcache.Storage average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period abortPut(SRMUser|String|URI|String 13.46± 0.67 6 56 7.56 127 17 hours getFileMetaData(SRMUser|URI|boolea 8.62± 0.04 3 631 8.73 46,135 17 hours getFromRemoteTURL(SRMUser|URI|Stri 368.74± 12.17 10 1,432 285.26 549 16 hours getGetTurl(SRMUser|URI|String[]|UR 0.15± 0.01 0 1 0.36 797 16 hours getPutTurl(SRMUser|String|String[] 0.14± 0.00 0 36 0.64 21,219 17 hours killRemoteTransfer(String) 0.00± 0.00 0 1 0.06 483 17 hours listDirectory(SRMUser|URI|boolean| 19.05± 1.23 0 183 16.24 174 16 hours pinFile(SRMUser|URI|String|long|St 128.90± 53.38 17 29,885 1,507.05 797 16 hours prepareToPut(SRMUser|URI|Long|Stri 15.46± 0.12 6 601 18.02 21,768 17 hours putDone(SRMUser|String|URI|boolean 18.43± 0.55 10 11,852 80.77 21,640 17 hours unPinFile(SRMUser|String|String) 7.17± 0.12 4 48 3.28 797 16 hours </pre> %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log74
Topic revision: r2 - 2016-06-10 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback