<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 04. 06. 2013 t3dcachedb03 again frozen This early morning t3dcachedb03 hanged for ~20 mins, SRM recovered, according to the dCache logs, but it actually *did not* and just a T3 Admin intervention fixed this issue by restarting the SRM cell in =t3se01=: ---++ probable cause After a talk with Peter it seems that the responsible are the *nightly snapshots* automatically taken by the Netapp used as a back end for the VMWare platform; these snapshots are taken on a 'volume base', not a 'VM base', so we can't tune these snapshots for a single VM. ---++ Fabio's e-mail vs Peter ( VMWare Manager ) <pre> Ciao Peter I've checked our T3 Nagios, it reports that t3dcachedb03 went down ( probably it frozen ) during these recent time intervals: 2nd June - 01:22 - 01:27 => ~5 mins 4th June - 02:58 - 03:19 => ~20 mins 5th June - 01:55 - 02:14 => ~20 mins our T3 Nagios tries 10 times the ping command before to declare the server pinged 'down'. </pre> ---++ t3se01-Domain-srm.log relevant logs <pre> 04 Jun 2013 03:01:15 (SRM-t3se01) [v1:srmget:47318872] org.postgresql.util.PSQLException: The connection attempt failed. 04 Jun 2013 03:01:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:01:30 (SRM-t3se01) [v1:srmget:47318872] SRM Authorization failed: {uoid=<1370307690381:5307988>;path=[>gPlazma@local];msg=Tunnel cell >gPlazma@local< not found at >dCacheDomain<} 04 Jun 2013 03:01:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:01:53 (SrmSpaceManager) [] expireSpaceReservations failed with An I/O error occured while sending to the backend. 04 Jun 2013 03:01:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:02:12 (SrmSpaceManager) [] An I/O error occured while sending to the backend. 04 Jun 2013 03:02:12 (SrmSpaceManager) [] Failed to insert Link Group = ops-linkGroup This connection has been closed. <---------------------- 04 Jun 2013 03:02:12 (SrmSpaceManager) [] update failed with This connection has been closed. 04 Jun 2013 03:02:12 (SrmSpaceManager) [] update of linkGroup ops-linkGroup failed with exception: This connection has been closed. 04 Jun 2013 03:02:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:02:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:03:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:03:36 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:44 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:15 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:04:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:59 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:05:37 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. <------ 04 Jun 2013 03:05:55 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:06:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:18 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:24 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:06:37 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:07:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:07:18 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:07:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:08:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:08:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:56 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:09:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:58 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 <------------------------------ 04 Jun 2013 03:10:07 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:10:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:10:25 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:10:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:10:34 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:10:43 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:10:50 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 <--------- 04 Jun 2013 03:10:50 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:10:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:10:59 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:11:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:11:16 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. 04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:11:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:11:37 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:11:46 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:11:56 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:11:56 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:11:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:12:03 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 <--------- 04 Jun 2013 03:12:03 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:12:06 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:12:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:12:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:12:19 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:12:23 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:12:23 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:12:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:12:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:12:31 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:12:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:12:41 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:12:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:12:51 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:12:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:12:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:13:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:13:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:13:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:27 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:13:28 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:13:28 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:13:35 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:13:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:13:44 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:13:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:13:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:57 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:14:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:14:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:14:05 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:14:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:14:14 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:14:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:14:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:14:24 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:14:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:14:30 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:14:31 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:14:31 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB <------------- 04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB 04 Jun 2013 03:14:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:15:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:15:43 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:15:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:16:23 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. </pre> ---------------- %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log46
Topic revision: r2 - 2013-06-05 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback