Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+!! %TOPIC% %TOC% ---++ Symptoms Summary: %FORMFIELD{"Symptom summary"}% ---++ Occurrences At what times did this problem occur (used to estimate frequency): | 2016-06-09 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00002.html][T3 user complaint]] | | 2016-06-09 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00013.html][Fabio's complaint]] | | 2016-06-26 | | | 2016-06-29 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00035.html][Fabio restarted again the SRM services]] | | 2016-07-19 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-07/msg00022.html][Urs' complaint]] | | 2016-07-23 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3/2016-07/msg00087.html][Constantin's complaint]] | ---++ Observations <!-- #collect here the information which may help to better understand the state of the system or services, e.g. #log excerpts, strace output, etc. #this also may help to identify the problem if similar conditions arise again --> In general SRM calls begin to fail or they get unresponsive ; Fabio opened this [[https://lists.dcache.org/sympa/arc/user-forum/2016-06/msg00057.html][dCache Thread]] that lead to a dCache bug ( see below ) ---++ Solution or Workaround ---+++ Solution * In the long term we have to update dCache from 2.15.5 to at least 2.15.11 because of the [[https://github.com/dcache/dcache/commit/7e9f7d67cdbe5e727db65127b3e4f794b5b3e391][bug]] ---+++ Workaround * %GREEN%DCACHE 2.15 UPDATED IN SEP '16%ENDCOLOR% * To quickly restore, run on =t3se01= : * =dcache restart t3se01-Domain-srm ; dcache restart t3se01-Domain-gsiftp= * T3 Users have a plethora of other ways to upload/download/list files while the SRM error is occurring ; %BLUE%root://%ENDCOLOR% is to be prefered : * <pre>gfal-ls -Hl gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl dcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl gsidcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl root://t3se01.psi.ch/store/user %BLUE%gfal-ls -Hl root://t3dcachedb03.psi.ch/pnfs/psi.ch/cms/trivcat/store/user%ENDCOLOR%</pre> ---++ Monitoring for this condition <!-- #how can this condition be recognized automatically, if at all? --> ---+++ Nagios [[https://t3nagios.psi.ch/nagios/cgi-bin/status.cgi?host=t3cmsvobox01&limit=0][The T3 Nagios is constantly stimulating the SRM interface by downloading a file from each T3 pool]] ( that in turn also checks if all the T3 pools are alive ) ---+++ dCache admin shell In order to mitigate these SRM errors Fabio put these limits on the several SRM ops : <pre> [root@t3se01 ~]# grep srm /etc/dcache/layouts/t3se01.conf | egrep "=[0-9]+" srm.request.max-requests=%ORANGE%400%ENDCOLOR% srm.request.put.max-requests=%BLUE%100%ENDCOLOR% srm.request.get.max-inprogress=%BLUE%100%ENDCOLOR% srm.request.copy.max-inprogress=%BLUE%100%ENDCOLOR% srm.request.max-transfers=%BLUE%100%ENDCOLOR% srm.limits.ls.entries=%BLUE%10000%ENDCOLOR% </pre> observable at run time by : %TWISTY% <pre> [t3dcachedb03] (local) admin > \c SRM-t3se01 [t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > info -l --- config (SRM configuration) --- "defaultSpaceLifetime" request lifetime: 86400000 "get" request lifetime: 14400000 "bringOnline" request lifetime: 14400000 "put" request lifetime: 14400000 "copy" request lifetime: 14400000 debug=true gsissl=true gridftp buffer_size=1048576 gridftp tcp_buffer_size=1048576 gridftp parallel_streams=10 gsiftpclinet=globus-url-copy urlcopy=../scripts/urlcopy.sh srm_root=/ timeout_script=../scripts/timeout.sh urlcopy timeout in seconds=3600 proxies directory=../proxies port=8443 srmHost=t3se01.psi.ch localSrmHosts=t3se01.psi.ch, useUrlcopyScript=false useGsiftpForSrmCopy=true useHttpForSrmCopy=true useDcapForSrmCopy=false useFtpForSrmCopy=true *** GetRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** BringOnlineRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** LsRequests Parameters ** max poll period in milliseconds =60000 switch to async mode delay=1000 *** PutRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** ReserveSpaceRequests Parameters ** request Lifetime in milliseconds =86400000 max poll period in milliseconds =60000 *** CopyRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 *** Bring Online Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Get Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Ls Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Reserve Space Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Copy Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Put Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds storage_info_update_period=30000 qosPluginClass=null qosConfigFile=null clientDNSLookup=false clientTransport=GSI --- gridsite-credential-service (GridSite delegation service providing delegated credentials to other dCache services) --- --- lb (Registers the door with a LoginBroker) --- LoginBroker : LoginBrokerTopic@local Protocol Family : srm Protocol Version : 1.1.1 Port : 8443 Addresses : [fe80:0:0:0:250:56ff:fe95:14dd/fe80:0:0:0:250:56ff:fe95:14dd, t3se01.psi.ch/192.33.123.24] Tags : [glue, srm] Root : / Read paths : [/] Write paths : [/] Update Time : 5 SECONDS Update Threshold : 10 % Last event : NOROUTE --- login-strategy (Caching gPlazma client) --- gPlazma login cache: CacheStats{hitCount=53808, missCount=1046, loadSuccessCount=976, loadExceptionCount=0, totalLoadTime=38321447732, evictionCount=956} gPlazma map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} gPlazma reverse map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} --- scheduler-bringonline (Scheduler for BRING-ONLINE operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%10000%ENDCOLOR%) ........ 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-copy (Scheduler for COPY operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%100%ENDCOLOR%) .......... 33 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 33 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-get (Scheduler for GET operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%100%ENDCOLOR%) .......... 0 [InProgress] Queued for transfer ............ 0 [RQueued] Waiting for transfer (max %BLUE%100%ENDCOLOR%) . 4 [Ready] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 4 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-ls (Scheduler for LS operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%50%ENDCOLOR%) ........... 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-put (Scheduler for PUT operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%50%ENDCOLOR%) ........... 0 [InProgress] Queued for transfer ............ 0 [RQueued] Waiting for transfer (max %BLUE%100%ENDCOLOR%) . 15 [Ready] ------------------------------------- Total requests (max %BLUE%100%ENDCOLOR%) ....... 15 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-reserve-space (Scheduler for RESERVE-SPACE operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%10%ENDCOLOR%) ........... 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- storage (dCache plugin for SRM) --- Custom reverse DNS lookup cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} Space token by owner cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} Space by token cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} </pre>
IssueForm
Affected Service
SRM
Symptom summary
SRM calls issued to
t3se01.psi.ch
fail or get unresponsive
Reason Understood
yes
Solution Exists
yes
Obsolete
yes
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r3 - 2016-10-09
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback