IssueSRMcallsFailOrGetUnresponsive

Symptoms

Summary: SRM calls issued to t3se01.psi.ch fail or get unresponsive

Occurrences

At what times did this problem occur (used to estimate frequency):
2016-06-09 T3 user complaint
2016-06-09 Fabio's complaint
2016-06-26  
2016-06-29 Fabio restarted again the SRM services
2016-07-19 Urs' complaint
2016-07-23 Constantin's complaint

Observations

In general SRM calls begin to fail or they get unresponsive ; Fabio opened this dCache Thread that lead to a dCache bug ( see below )

Solution or Workaround

  • In the long term we have to update dCache from 2.15.5 to at least 2.15.11 because of the bug
  • To quickly restore, run on t3se01 :
    • dcache restart t3se01-Domain-srm ; dcache restart t3se01-Domain-gsiftp
  • T3 Users have a plethora of other ways to upload/download/list files while the SRM error is occurring ; root:// is to be prefered :
    • gfal-ls -Hl gsiftp://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl dcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl gsidcap://t3se01.psi.ch/pnfs/psi.ch/cms/trivcat/store/user
      gfal-ls -Hl root://t3se01.psi.ch/store/user
      gfal-ls -Hl root://t3dcachedb03.psi.ch/pnfs/psi.ch/cms/trivcat/store/user

Monitoring for this condition

Nagios

The T3 Nagios is constantly stimulating the SRM interface by downloading a file from each T3 pool ( that in turn also checks if all the T3 pools are alive )

dCache admin shell

In order to mitigate these SRM errors Fabio put these limits on the several SRM ops :
[root@t3se01 ~]# grep srm  /etc/dcache/layouts/t3se01.conf | egrep "=[0-9]+"  
srm.request.max-requests=400
srm.request.put.max-requests=100
srm.request.get.max-inprogress=100
srm.request.copy.max-inprogress=100
srm.request.max-transfers=100
srm.limits.ls.entries=10000
observable at run time by : More... Close
[t3dcachedb03] (local) admin > \c SRM-t3se01
[t3dcachedb03] (SRM-t3se01@t3se01-Domain-srm) admin > info -l 
--- config (SRM configuration) ---
	"defaultSpaceLifetime"  request lifetime: 86400000
	"get"  request lifetime: 14400000
	"bringOnline"  request lifetime: 14400000
	"put"  request lifetime: 14400000
	"copy" request lifetime: 14400000
	debug=true
	gsissl=true
	gridftp buffer_size=1048576
	gridftp tcp_buffer_size=1048576
	gridftp parallel_streams=10
	gsiftpclinet=globus-url-copy
	urlcopy=../scripts/urlcopy.sh
	srm_root=/
	timeout_script=../scripts/timeout.sh
	urlcopy timeout in seconds=3600
	proxies directory=../proxies
	port=8443
	srmHost=t3se01.psi.ch
	localSrmHosts=t3se01.psi.ch, 
	useUrlcopyScript=false
	useGsiftpForSrmCopy=true
	useHttpForSrmCopy=true
	useDcapForSrmCopy=false
	useFtpForSrmCopy=true
		 *** GetRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** BringOnlineRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** LsRequests Parameters **
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** PutRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		 switch to async mode delay=1000
		 *** ReserveSpaceRequests Parameters **
		 request Lifetime in milliseconds =86400000
		 max poll period in milliseconds =60000
		 *** CopyRequests Parameters **
		 request Lifetime in milliseconds =14400000
		 max poll period in milliseconds =60000
		*** Bring Online Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Get Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Ls Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Reserve Space Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Copy Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
		*** Put Store Parameters ***
		databaseEnabled=true
		storeCompletedRequestsOnly=true
		requestHistoryDatabaseEnabled=true
		cleanPendingRequestsOnRestart=false
		keepRequestHistoryPeriod=10 days
		expiredRequestRemovalPeriod=600 seconds
	storage_info_update_period=30000
	qosPluginClass=null
	qosConfigFile=null
	clientDNSLookup=false
	clientTransport=GSI

--- gridsite-credential-service (GridSite delegation service providing delegated credentials to other dCache services) ---

--- lb (Registers the door with a LoginBroker) ---
    LoginBroker      : LoginBrokerTopic@local
    Protocol Family  : srm
    Protocol Version : 1.1.1
    Port             : 8443
    Addresses        : [fe80:0:0:0:250:56ff:fe95:14dd/fe80:0:0:0:250:56ff:fe95:14dd, t3se01.psi.ch/192.33.123.24]
    Tags             : [glue, srm]
    Root             : /
    Read paths       : [/]
    Write paths      : [/]
    Update Time      : 5 SECONDS
    Update Threshold : 10 %
    Last event       : NOROUTE

--- login-strategy (Caching gPlazma client) ---
gPlazma login cache: CacheStats{hitCount=53808, missCount=1046, loadSuccessCount=976, loadExceptionCount=0, totalLoadTime=38321447732, evictionCount=956}
gPlazma map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
gPlazma reverse map cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}

--- scheduler-bringonline (Scheduler for BRING-ONLINE operations) ---
    Queued .........................   0     [Queued]
    In progress (max 10000) ........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-copy (Scheduler for COPY operations) ---
    Queued .........................   0     [Queued]
    In progress (max 100) ..........  33     [InProgress]
    -------------------------------------
    Total requests (max 400) .......  33

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-get (Scheduler for GET operations) ---
    Queued .........................   0     [Queued]
    In progress (max 100) ..........   0     [InProgress]
    Queued for transfer ............   0     [RQueued]
    Waiting for transfer (max 100) .   4     [Ready]
    -------------------------------------
    Total requests (max 400) .......   4

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-ls (Scheduler for LS operations) ---
    Queued .........................   0     [Queued]
    In progress (max 50) ...........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-put (Scheduler for PUT operations) ---
    Queued .........................   0     [Queued]
    In progress (max 50) ...........   0     [InProgress]
    Queued for transfer ............   0     [RQueued]
    Waiting for transfer (max 100) .  15     [Ready]
    -------------------------------------
    Total requests (max 100) .......  15

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- scheduler-reserve-space (Scheduler for RESERVE-SPACE operations) ---
    Queued .........................   0     [Queued]
    In progress (max 10) ...........   0     [InProgress]
    -------------------------------------
    Total requests (max 400) .......   0

    Scheduling strategy             : inprogress-fair-share
    Transfer strategy               : fair-share

--- storage (dCache plugin for SRM) ---
Custom reverse DNS lookup cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
Space token by owner cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
Space by token cache: CacheStats{hitCount=0, missCount=0, loadSuccessCount=0, loadExceptionCount=0, totalLoadTime=0, evictionCount=0}
IssueForm
Affected Service SRM
Symptom summary SRM calls issued to t3se01.psi.ch fail or get unresponsive
Reason Understood yes
Solution Exists workaround
Obsolete no
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2016-07-23 - FabioMartinelli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback