<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> ---+!! %TOPIC% %TOC% ---++ Symptoms Summary: %FORMFIELD{"Symptom summary"}% ---++ Occurrences At what times did this problem occur (used to estimate frequency): | 2016-06-09 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00002.html][T3 user complaint]] | | 2016-06-09 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00013.html][Fabio's complaint]] | | 2016-06-26 | | | 2016-06-29 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-06/msg00035.html][Fabio restarted again the SRM services]] | | 2016-07-19 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3-users/2016-07/msg00022.html][Urs' complaint]] | | 2016-07-23 | [[https://psilists.ethz.ch/sympa/arc/cms-tier3/2016-07/msg00087.html][Constantin's complaint]] | ---++ Observations <!-- #collect here the information which may help to better understand the state of the system or services, e.g. #log excerpts, strace output, etc. #this also may help to identify the problem if similar conditions arise again --> In general SRM calls begin to fail or they get unresponsive ; Fabio opened this [[https://lists.dcache.org/sympa/arc/user-forum/2016-06/msg00057.html][dCache Thread]] that lead to a dCache bug ( see below ) ---++ Solution or Workaround ---+++ Solution * In the long term we have to update dCache from 2.15.5 to at least 2.15.11 because of the [[https://github.com/dcache/dcache/commit/7e9f7d67cdbe5e727db65127b3e4f794b5b3e391][bug]] ---+++ Workaround * %GREEN%DCACHE 2.15 UPDATED IN SEP '16%ENDCOLOR% * To quickly restore, run on =t3se01= : * =dcache restart t3se01-Domain-srm ; dcache restart t3se01-Domain-gsiftp= * T3 Users have a plethora of other ways to upload/download/list files while the SRM error is occurring ; %BLUE%root://%ENDCOLOR% is to be prefered : * <pre>gfal-ls -Hl gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl dcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl gsidcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user gfal-ls -Hl root://t3se01.psi.ch/store/user %BLUE%gfal-ls -Hl root://t3dcachedb03.psi.ch/pnfs/psi.ch/cms/trivcat/store/user%ENDCOLOR%</pre> ---++ Monitoring for this condition <!-- #how can this condition be recognized automatically, if at all? --> ---+++ Nagios [[https://t3nagios.psi.ch/nagios/cgi-bin/status.cgi?host=t3cmsvobox01&limit=0][The T3 Nagios is constantly stimulating the SRM interface by downloading a file from each T3 pool]] ( that in turn also checks if all the T3 pools are alive ) ---+++ dCache admin shell In order to mitigate these SRM errors Fabio put these limits on the several SRM ops : <pre> [root@t3se01 ~]# grep srm /etc/dcache/layouts/t3se01.conf | egrep "=[0-9]+" srm.request.max-requests=%ORANGE%400%ENDCOLOR% srm.request.put.max-requests=%BLUE%100%ENDCOLOR% srm.request.get.max-inprogress=%BLUE%100%ENDCOLOR% srm.request.copy.max-inprogress=%BLUE%100%ENDCOLOR% srm.request.max-transfers=%BLUE%100%ENDCOLOR% srm.limits.ls.entries=%BLUE%10000%ENDCOLOR% </pre> observable at run time by : %TWISTY% <pre> [t3dcachedb03] (local) admin > \c SRM-t3se01 [t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > info -l --- config (SRM configuration) --- "defaultSpaceLifetime" request lifetime: 86400000 "get" request lifetime: 14400000 "bringOnline" request lifetime: 14400000 "put" request lifetime: 14400000 "copy" request lifetime: 14400000 debug=true gsissl=true gridftp buffer_size=1048576 gridftp tcp_buffer_size=1048576 gridftp parallel_streams=10 gsiftpclinet=globus-url-copy urlcopy=../scripts/urlcopy.sh srm_root=/ timeout_script=../scripts/timeout.sh urlcopy timeout in seconds=3600 proxies directory=../proxies port=8443 srmHost=t3se01.psi.ch localSrmHosts=t3se01.psi.ch, useUrlcopyScript=false useGsiftpForSrmCopy=true useHttpForSrmCopy=true useDcapForSrmCopy=false useFtpForSrmCopy=true *** GetRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** BringOnlineRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** LsRequests Parameters ** max poll period in milliseconds =60000 switch to async mode delay=1000 *** PutRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 switch to async mode delay=1000 *** ReserveSpaceRequests Parameters ** request Lifetime in milliseconds =86400000 max poll period in milliseconds =60000 *** CopyRequests Parameters ** request Lifetime in milliseconds =14400000 max poll period in milliseconds =60000 *** Bring Online Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Get Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Ls Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Reserve Space Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Copy Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds *** Put Store Parameters *** databaseEnabled=true storeCompletedRequestsOnly=true requestHistoryDatabaseEnabled=true cleanPendingRequestsOnRestart=false keepRequestHistoryPeriod=10 days expiredRequestRemovalPeriod=600 seconds storage_info_update_period=30000 qosPluginClass=null qosConfigFile=null clientDNSLookup=false clientTransport=GSI --- gridsite-credential-service (GridSite delegation service providing delegated credentials to other dCache services) --- --- lb (Registers the door with a LoginBroker) --- LoginBroker : LoginBrokerTopic@local Protocol Family : srm Protocol Version : 1.1.1 Port : 8443 Addresses : [fe80:0:0:0:250:56ff:fe95:14dd/fe80:0:0:0:250:56ff:fe95:14dd, t3se01.psi.ch/192.33.123.24] Tags : [glue, srm] Root : / Read paths : [/] Write paths : [/] Update Time : 5 SECONDS Update Threshold : 10 % Last event : NOROUTE --- login-strategy (Caching gPlazma client) --- gPlazma login cache: CacheStats{hitCount=53808, missCount=1046, loadSuccessCount=976, loadExceptionCount=0, totalLoadTime=38321447732, evictionCount=956} gPlazma map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} gPlazma reverse map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} --- scheduler-bringonline (Scheduler for BRING-ONLINE operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%10000%ENDCOLOR%) ........ 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-copy (Scheduler for COPY operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%100%ENDCOLOR%) .......... 33 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 33 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-get (Scheduler for GET operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%100%ENDCOLOR%) .......... 0 [InProgress] Queued for transfer ............ 0 [RQueued] Waiting for transfer (max %BLUE%100%ENDCOLOR%) . 4 [Ready] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 4 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-ls (Scheduler for LS operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%50%ENDCOLOR%) ........... 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-put (Scheduler for PUT operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%50%ENDCOLOR%) ........... 0 [InProgress] Queued for transfer ............ 0 [RQueued] Waiting for transfer (max %BLUE%100%ENDCOLOR%) . 15 [Ready] ------------------------------------- Total requests (max %BLUE%100%ENDCOLOR%) ....... 15 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- scheduler-reserve-space (Scheduler for RESERVE-SPACE operations) --- Queued ......................... 0 [Queued] In progress (max %BLUE%10%ENDCOLOR%) ........... 0 [InProgress] ------------------------------------- Total requests (max %ORANGE%400%ENDCOLOR%) ....... 0 Scheduling strategy : inprogress-fair-share Transfer strategy : fair-share --- storage (dCache plugin for SRM) --- Custom reverse DNS lookup cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} Space token by owner cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} Space by token cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0} </pre>
IssueForm
Affected Service
SRM
Symptom summary
SRM calls issued to
t3se01.psi.ch
fail or get unresponsive
Reason Understood
yes
Solution Exists
yes
Obsolete
yes
This topic: CmsTier3
>
WebHome
>
AdminArea
>
IssueSRMcallsFailOrGetUnresponsive
Topic revision: r3 - 2016-10-09 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback