Difference: IssueSRMcallsFailOrGetUnresponsive (1 vs. 2)

Revision 22016-07-23 - FabioMartinelli

Line: 1 to 1
 
META TOPICPARENT name="AdminArea"

Revision 12016-07-19 - FabioMartinelli

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="AdminArea"
<-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   # * Set ALLOWTOPICCHANGE = TWikiAdminGroup,Main.CMSAdminGroup
   # * Set ALLOWTOPICRENAME = TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = TWikiAdminGroup,Main.CMSAdminGroup
-->

IssueSRMcallsFailOrGetUnresponsive

Symptoms

Summary: SRM calls issued to t3se01.psi.ch fail or get unresponsive

Occurrences

At what times did this problem occur (used to estimate frequency):
2016-06-09 T3 user complaint
2016-06-09 Fabio's complaint
2016-06-26  
2016-06-29 Fabio restarted again the SRM services
2016-07-19 Urs' complaint

Observations

<--
   #collect here the information which may help to better understand the state of the system or services, e.g.
   #log excerpts, strace output, etc.
   #this also may help to identify the problem if similar conditions arise again
-->
In general SRM calls begin to fail or they get unresponsive ; Fabio opened this dCache Thread that lead to a dCache bug ( see below )

Solution or Workaround

  • In the long term we have to update dCache from 2.15.5 to at least 2.15.11 because of the bug
  • To quickly restore, run on t3se01 :
    • dcache restart t3se01-Domain-srm and if that's not enough also dcache restart t3se01-Domain-gsiftp
  • T3 Users have a plethora of other ways to upload/download/list files while the SRM error is occurring ; root:// is to be prefered :
    • gfal-ls -Hl gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl dcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl gsidcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl root://t3se01.psi.ch/store/user
      gfal-ls -Hl root://t3dcachedb03.psi.ch/pnfs/psi.ch/cms/trivcat/store/user

Monitoring for this condition

<--
   #how can this condition be recognized automatically, if at all?
-->

Nagios

The T3 Nagios is constantly stimulating the SRM interface by downloading a file from each T3 pool ( that in turn also checks if all the T3 pools are alive )

dCache admin shell

In order to mitigate these SRM errors Fabio put these limits on the several SRM ops :
[root@t3se01 ~]# grep srm  /etc/dcache/layouts/t3se01.conf | egrep "=[0-9]+"  
srm.request.max-requests=400
srm.request.put.max-requests=100
srm.request.get.max-inprogress=100
srm.request.copy.max-inprogress=100
srm.request.max-transfers=100
srm.limits.ls.entries=10000
observable at run time by : More... Close
<--/twistyPlugin twikiMakeVisibleInline-->
[t3dcachedb03] (local) admin > \c SRM-t3se01
[t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > info -l 
--- config (SRM configuration) ---
	"defaultSpaceLifetime"  request lifetime: 86400000
	"get"  request lifetime: 14400000
	"bringOnline"  request lifetime: 14400000
	"put"  request lifetime: 14400000
	"copy" request lifetime: 14400000
	debug=true
	gsissl=true
	gridftp buffer_size=1048576
	gridftp tcp_buffer_size=1048576
	gridftp parallel_streams=10
	gsiftpclinet=globus-url-copy
	urlcopy=../scripts/urlcopy.sh
	srm_root=/
	timeout_script=../scripts/timeout.sh
	urlcopy timeout in seconds=3600
	proxies directory=../proxies
	port=8443
	srmHost=t3se01.psi.ch
	localSrmHosts=t3se01.psi.ch, 
	useUrlcopyScript=false
	useGsiftpForSrmCopy=true
	useHttpForSrmCopy=true
	useDcapForSrmCopy=false
	useFtpForSrmCopy=true
		 *** GetRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** BringOnlineRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** LsRequests Parameters **
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** PutRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** ReserveSpaceRequests Parameters **
		 request Lifetime in milliseconds =86400000
		 max poll period in milliseconds =60000
		 *** CopyRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		*** Bring Online Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Get Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Ls Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Reserve Space Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Copy Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Put Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
	storage_info_update_period=30000
	qosPluginClass=null
	qosConfigFile=null
	clientDNSLookup=false
	clientTransport=GSI

--- gridsite-credential-service (GridSite delegation service providing delegated credentials to other dCache services) ---

--- lb (Registers the door with a LoginBroker) ---
    LoginBroker      : LoginBrokerTopic@local
    Protocol Family  : srm
    Protocol Version : 1.1.1
    Port             : 8443
    Addresses        : [fe80:0:0:0:250:56ff:fe95:14dd/fe80:0:0:0:250:56ff:fe95:14dd, t3se01.psi.ch/192.33.123.24]
    Tags             : [glue, srm]
    Root             : /
    Read paths       : [/]
    Write paths      : [/]
    Update Time      : 5 SECONDS
    Update Threshold : 10 %
    Last event       : NOROUTE

--- login-strategy (Caching gPlazma client) ---
gPlazma login cache: CacheStats{hitCount=53808, missCount=1046, loadSuccessCount=976, loadExceptionCount=0, totalLoadTime=38321447732, evictionCount=956}
gPlazma map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
gPlazma reverse map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}

--- scheduler-bringonline (Scheduler for BRING-ONLINE operations) ---
    Queued .........................   0     [Queued]
    In progress (max 10000) ........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-copy (Scheduler for COPY operations) ---
    Queued .........................   0     [Queued]
    In progress (max 100) ..........  33     [InProgress]
    -------------------------------------
    Total requests (max 400) .......  33

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-get (Scheduler for GET operations) ---
    Queued .........................   0     [Queued]
    In progress (max 100) ..........   0     [InProgress]
    Queued for transfer ............   0     [RQueued]
    Waiting for transfer (max 100) .   4     [Ready]
    -------------------------------------
    Total requests (max 400) .......   4

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-ls (Scheduler for LS operations) ---
    Queued .........................   0     [Queued]
    In progress (max 50) ...........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-put (Scheduler for PUT operations) ---
    Queued .........................   0     [Queued]
    In progress (max 50) ...........   0     [InProgress]
    Queued for transfer ............   0     [RQueued]
    Waiting for transfer (max 100) .  15     [Ready]
    -------------------------------------
    Total requests (max 100) .......  15

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-reserve-space (Scheduler for RESERVE-SPACE operations) ---
    Queued .........................   0     [Queued]
    In progress (max 10) ...........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- storage (dCache plugin for SRM) ---
Custom reverse DNS lookup cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
Space token by owner cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
Space by token cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
<--/twistyPlugin-->

META FORM name="IssueForm"
FORM FIELD Affected Service AffectedService SRM
FORM FIELD Symptom summary Symptomsummary SRM calls issued to t3se01.psi.ch fail or get unresponsive
FORM FIELD Reason Understood ReasonUnderstood yes
FORM FIELD Solution Exists SolutionExists workaround
FORM FIELD Obsolete Obsolete no
META TOPICMOVED by="fabiom" date="1468925236" from="CmsTier3.IssueSRMcallsFailOrGetsUnresponsive" to="CmsTier3.IssueSRMcallsFailOrGetUnresponsive"
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback