Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 09. 06. 2016 dCache 2.15 stuck on =t3se01= BE AWARE OF THE LATEST 2.15 Derek's tools https://github.com/fabiomartinelli/dcache-shellutils </br></br> * today =t3se01= had a high load * by running =lsof | grep java= I've seen >100 =gsiftp= connections coming from =t3ui17= * I've connected to =t3ui17= * again by =lsof= I've realized it was =gperrin= and his =gfalFS= vs =srm://t3se01.psi.ch= * I've killed his =gfalFS= mount point * but =t3se01= was still stuck because dCache already went in Out of Memory and it was not able to recover :( * =t3se01= dCache logs are : * <pre>[root@t3se01 ~]# dcache services DOMAIN SERVICE CELL LOG t3se01-Domain-dcap dcap DCap-t3se01 /var/log/dcache/t3se01-Domain-dcap.log t3se01-Domain-gsidcap dcap DCap-gsi-t3se01 /var/log/dcache/t3se01-Domain-gsidcap.log t3se01-Domain-gsiftp ftp GFTP-t3se01 /var/log/dcache/t3se01-Domain-gsiftp.log t3se01-Domain-srm srm SRM-t3se01 /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-srm spacemanager SpaceManager /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-srm transfermanagers RemoteTransferManager /var/log/dcache/t3se01-Domain-srm.log t3se01-Domain-utility pinmanager PinManager /var/log/dcache/t3se01-Domain-utility.log t3se01-Domain-info info info /var/log/dcache/t3se01-Domain-info.log t3se01-Domain-xrootd xrootd Xrootd-t3se01 /var/log/dcache/t3se01-Domain-xrootd.log dCacheDomain poolmanager PoolManager /var/log/dcache/dCacheDomain.log dCacheDomain topo topo /var/log/dcache/dCacheDomain.log</pre> * you check theme in parallel by : * <pre>[root@t3se01 ~]# dcache services | grep log | awk '{print $4}' | xargs -iI tail %BLUE%-v%ENDCOLOR% I ==> %BLUE%/var/log/dcache/t3se01-Domain-dcap.log%ENDCOLOR% <== 09 Jun 2016 10:51:30 (DCap-t3se01-<unknown>-AAU01IboGLA) [door:DCap-t3se01-<unknown>-AAU01IboGLA] Executing command: 3 0 client open "dcap://t3se01.psi.ch:22125//pnfs/psi.ch/cms/trivcat/store/user/cheidegg/sea/11/2016-06-01-17-17-00/QCD_Pt80to120_EMEnriched.root" r t3ui17.psi.ch 45550 -timeout=-1 -onerror=default -passive -uid=609 ... ==> %BLUE%/var/log/dcache/t3se01-Domain-gsidcap.log%ENDCOLOR% <== 09 Jun 2016 10:47:27 (System) [info] Message arrived : <CM: S=[>info@t3se01-Domain-info:*@t3se01-Domain-info:*@dCacheDomain];D=[>System@t3se01-Domain-gsidcap];C=java.lang.String;O=<1465462047010:54631>;LO=<1465462047010:54630>;TTL=1000> ... </pre> * %RED%eventually I've restarted dCache on =t3se01= by <pre>dache restart</pre>%ENDCOLOR% GENERAL ADVICES : </br> * recall that you can check the live file transfers by : * <pre>lynx --dump -width=200 http://t3dcachedb.psi.ch:2288/context/transfers.html</pre> * <pre>Door Domain Seq Prot Owner Proc PnfsId Pool Host Status Since S Trans. (KB) Speed (KB/s) DCap-t3se01--AAU00-Wza3A t3se01-Domain-dcap 3 dcap-3 521 7033 00005283B084FFAD4845B6719436AAF7DC2A t3fs14_cms_1 192.33.123.93 WaitingForDoorTransferOk 00:35:08 RUNNING 10850796 5145 DCap-t3se01--AAU00-XuScA t3se01-Domain-dcap 829 dcap-3 621 18767 00002FB6919E4FBB4FBBB9C0FAD2BC8BDA33 t3fs03_cms 192.33.123.139 WaitingForDoorTransferOk 00:00:20 RUNNING 315556 15398 ... </pre> * How to check the failed SRM operations : * <pre>$ alias dcache alias dcache='ssh -2 -l admin -p 22224 t3dcachedb.psi.ch'</pre> * <pre>$ dcache [t3dcachedb03] (local) admin > \c SRM-t3se01 [t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > print srm counters SRMServerV2 requests failed SrmAbortFilesRequest 18 0 SrmAbortRequestRequest 624 1 SrmCopyRequest 24 0 SrmLsRequest 46135 1433 SrmMkdirRequest 313 71 SrmPingRequest 21094 0 SrmPrepareToGetRequest 797 0 SrmPrepareToPutRequest 21219 0 SrmPutDoneRequest 21097 0 SrmReleaseFilesRequest 699 0 SrmRmRequest 21295 %RED%21033%ENDCOLOR% SrmStatusOfCopyRequestRequest 5286 5 SrmStatusOfGetRequestRequest 27 0 Total 138628 22543 diskCacheV111.srm.dcache.Storage requests failed abortPut(SRMUser|String|URI|String 127 1 getFileMetaData(SRMUser|URI|boolea 46135 1432 getFromRemoteTURL(SRMUser|URI|Stri 549 0 getGetTurl(SRMUser|URI|String[]|UR 797 0 getPutTurl(SRMUser|String|String[] 21219 0 killRemoteTransfer(String) 483 0 listDirectory(SRMUser|URI|boolean| 174 1 pinFile(SRMUser|URI|String|long|St 797 0 prepareToPut(SRMUser|URI|Long|Stri 21768 0 putDone(SRMUser|String|URI|boolean 21640 0 unPinFile(SRMUser|String|String) 797 0 Total 114486 1434 SRMServerV2 average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period SrmAbortRequestRequest 4.30± 0.29 0 57 7.23 624 17 hours SrmCopyRequest 61.79± 8.23 32 203 40.31 24 16 hours SrmLsRequest 9.48± 0.04 3 651 9.37 46,135 17 hours SrmMkdirRequest 13.57± 2.66 3 660 47.01 313 17 hours SrmPrepareToGetRequest 46.04± 2.91 18 1,002 82.10 797 16 hours SrmPrepareToPutRequest 14.71± 0.10 6 606 14.16 21,219 17 hours SrmPutDoneRequest 18.92± 0.56 10 11,852 81.85 21,097 17 hours SrmReleaseFilesRequest 0.64± 0.05 0 31 1.29 699 16 hours SrmRmRequest 114.86± 18.71 3 71,399 2,730.74 21,295 17 hours SrmStatusOfCopyRequestRequest 1.65± 0.26 0 884 18.77 5,277 16 hours SrmStatusOfGetRequestRequest 0.37± 0.11 0 2 0.56 27 15 hours diskCacheV111.srm.dcache.Storage average±stderr(ms) min(ms) max(ms) STD(ms) Samples Period abortPut(SRMUser|String|URI|String 13.46± 0.67 6 56 7.56 127 17 hours getFileMetaData(SRMUser|URI|boolea 8.62± 0.04 3 631 8.73 46,135 17 hours getFromRemoteTURL(SRMUser|URI|Stri 368.74± 12.17 10 1,432 285.26 549 16 hours getGetTurl(SRMUser|URI|String[]|UR 0.15± 0.01 0 1 0.36 797 16 hours getPutTurl(SRMUser|String|String[] 0.14± 0.00 0 36 0.64 21,219 17 hours killRemoteTransfer(String) 0.00± 0.00 0 1 0.06 483 17 hours listDirectory(SRMUser|URI|boolean| 19.05± 1.23 0 183 16.24 174 16 hours pinFile(SRMUser|URI|String|long|St 128.90± 53.38 17 29,885 1,507.05 797 16 hours prepareToPut(SRMUser|URI|Long|Stri 15.46± 0.12 6 601 18.02 21,768 17 hours putDone(SRMUser|String|URI|boolean 18.43± 0.55 10 11,852 80.77 21,640 17 hours unPinFile(SRMUser|String|String) 7.17± 0.12 4 48 3.28 797 16 hours </pre> %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r2 - 2016-06-10
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback