Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

04. 06. 2013 t3dcachedb03 again frozen

This early morning t3dcachedb03 hanged for ~20 mins, SRM recovered, according to the dCache logs, but it actually did not and just a T3 Admin intervention fixed this issue by restarting the SRM cell in t3se01:

probable cause

After a talk with Peter it seems that the responsible are the nightly snapshots automatically taken by the Netapp used as a back end for the VMWare platform; these snapshots are taken on a 'volume base', not a 'VM base', so we can't tune these snapshots for a single VM.

Fabio's e-mail vs Peter ( VMWare Manager )

Ciao Peter
I've checked our T3 Nagios, it reports that t3dcachedb03 went down ( probably it frozen ) during these recent time intervals:
2nd June - 01:22 - 01:27 =>  ~5 mins
4th June - 02:58 - 03:19 => ~20 mins
5th June - 01:55 - 02:14 => ~20 mins
our T3 Nagios tries 10 times the ping command before to declare the server pinged 'down'.

t3se01-Domain-srm.log relevant logs

04 Jun 2013 03:01:15 (SRM-t3se01) [v1:srmget:47318872] org.postgresql.util.PSQLException: The connection attempt failed.
04 Jun 2013 03:01:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:01:30 (SRM-t3se01) [v1:srmget:47318872] SRM Authorization failed: {uoid=<1370307690381:5307988>;path=[>gPlazma@local];msg=Tunnel cell >gPlazma@local< not found at >dCacheDomain<}
04 Jun 2013 03:01:46 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:01:53 (SrmSpaceManager) [] expireSpaceReservations failed with An I/O error occured while sending to the backend.
04 Jun 2013 03:01:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:02:12 (SrmSpaceManager) [] An I/O error occured while sending to the backend.

04 Jun 2013 03:02:12 (SrmSpaceManager) [] Failed to insert Link Group = ops-linkGroup This connection has been closed.  <----------------------

04 Jun 2013 03:02:12 (SrmSpaceManager) [] update failed with This connection has been closed.
04 Jun 2013 03:02:12 (SrmSpaceManager) [] update of linkGroup ops-linkGroup failed with exception: This connection has been closed.
04 Jun 2013 03:02:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:02:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:03:16 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:03:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:03:36 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:03:44 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:03:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:04:15 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:04:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:04:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:04:59 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:05:16 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:05:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates

04 Jun 2013 03:05:37 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend.   <------

04 Jun 2013 03:05:55 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:05:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:06:06 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:06:17 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:06:18 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:06:24 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:06:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:06:37 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:06:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:07:06 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:07:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend.
04 Jun 2013 03:07:18 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:07:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:07:46 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:07:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:08:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:08:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:09:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:09:56 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:09:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates

04 Jun 2013 03:09:58 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5   <------------------------------

04 Jun 2013 03:10:07 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:10:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:10:25 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:10:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:10:34 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:10:43 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0

04 Jun 2013 03:10:50 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001   <---------

04 Jun 2013 03:10:50 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:10:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:10:59 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:11:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend.
04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:11:16 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend.
04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:11:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:11:37 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:11:46 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:11:56 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:11:56 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:11:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates

04 Jun 2013 03:12:03 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001   <---------

04 Jun 2013 03:12:03 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:12:06 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:12:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:12:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:12:19 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:12:23 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001
04 Jun 2013 03:12:23 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:12:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:12:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:12:31 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:12:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:12:41 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:12:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:12:51 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:12:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:12:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:13:01 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:13:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001
04 Jun 2013 03:13:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:13:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:13:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:13:26 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:13:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:13:27 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:13:28 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001
04 Jun 2013 03:13:28 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:13:35 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:13:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:13:44 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:13:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:13:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:13:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:13:57 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:14:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001
04 Jun 2013 03:14:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:14:05 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:14:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:14:14 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1
04 Jun 2013 03:14:16 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:14:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:14:24 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0
04 Jun 2013 03:14:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:14:30 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3
04 Jun 2013 03:14:31 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001
04 Jun 2013 03:14:31 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms
04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2
04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5
04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4
04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1

04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB   <-------------
04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB

04 Jun 2013 03:14:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:15:06 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException
04 Jun 2013 03:15:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:15:43 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend.
04 Jun 2013 03:15:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates
04 Jun 2013 03:16:23 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend.


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2013-06-05 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback