Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M% ---+ 04. 06. 2013 t3dcachedb03 again frozen This early morning t3dcachedb03 hanged for ~20 mins, SRM recovered, according to the dCache logs, but it actually *did not* and just a T3 Admin intervention fixed this issue by restarting the SRM cell in =t3se01=: ---++ probable cause After a talk with Peter it seems that the responsible are the *nightly snapshots* automatically taken by the Netapp used as a back end for the VMWare platform; these snapshots are taken on a 'volume base', not a 'VM base', so we can't tune these snapshots for a single VM. ---++ Fabio's e-mail vs Peter ( VMWare Manager ) <pre> Ciao Peter I've checked our T3 Nagios, it reports that t3dcachedb03 went down ( probably it frozen ) during these recent time intervals: 2nd June - 01:22 - 01:27 => ~5 mins 4th June - 02:58 - 03:19 => ~20 mins 5th June - 01:55 - 02:14 => ~20 mins our T3 Nagios tries 10 times the ping command before to declare the server pinged 'down'. </pre> ---++ t3se01-Domain-srm.log relevant logs <pre> 04 Jun 2013 03:01:15 (SRM-t3se01) [v1:srmget:47318872] org.postgresql.util.PSQLException: The connection attempt failed. 04 Jun 2013 03:01:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:01:30 (SRM-t3se01) [v1:srmget:47318872] SRM Authorization failed: {uoid=<1370307690381:5307988>;path=[>gPlazma@local];msg=Tunnel cell >gPlazma@local< not found at >dCacheDomain<} 04 Jun 2013 03:01:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:01:53 (SrmSpaceManager) [] expireSpaceReservations failed with An I/O error occured while sending to the backend. 04 Jun 2013 03:01:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:02:12 (SrmSpaceManager) [] An I/O error occured while sending to the backend. 04 Jun 2013 03:02:12 (SrmSpaceManager) [] Failed to insert Link Group = ops-linkGroup This connection has been closed. <---------------------- 04 Jun 2013 03:02:12 (SrmSpaceManager) [] update failed with This connection has been closed. 04 Jun 2013 03:02:12 (SrmSpaceManager) [] update of linkGroup ops-linkGroup failed with exception: This connection has been closed. 04 Jun 2013 03:02:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:02:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:03:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:03:36 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:44 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:03:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:15 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:04:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:04:59 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:05:37 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. <------ 04 Jun 2013 03:05:55 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:05:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:06:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:17 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:18 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:24 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:06:37 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:06:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:07:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:07:18 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:07:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:07:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:08:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:08:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:56 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:09:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:09:58 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 <------------------------------ 04 Jun 2013 03:10:07 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:10:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:10:25 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:10:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:10:34 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:10:43 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:10:50 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 <--------- 04 Jun 2013 03:10:50 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:10:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:10:59 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:11:07 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:11:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:11:16 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. 04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:11:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:11:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:11:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:11:37 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:11:46 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:11:56 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:11:56 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:11:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:12:03 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 <--------- 04 Jun 2013 03:12:03 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:12:06 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:12:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:12:16 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:12:19 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:12:23 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:12:23 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:12:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:12:28 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:12:31 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:12:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:12:41 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:12:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:12:51 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:12:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:12:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:13:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:13:08 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:13:18 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:13:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:13:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:27 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:13:28 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:13:28 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:13:35 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:13:38 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:13:44 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:46 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:13:47 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:13:54 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:13:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:13:57 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:14:01 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:14:01 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:14:05 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:14:11 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:14:14 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:14:16 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:14:21 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:14:24 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 0 04 Jun 2013 03:14:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:14:30 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 3 04 Jun 2013 03:14:31 (SrmSpaceManager) [] Database access problem. Killing off all remaining connections in the connection pool. SQL State = 08001 04 Jun 2013 03:14:31 (SrmSpaceManager) [] Error in trying to obtain a connection. Retrying in 7000ms 04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 2 04 Jun 2013 03:14:39 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 5 04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 4 04 Jun 2013 03:14:48 (SrmSpaceManager) [] Failed to acquire connection. Sleeping for 7000ms. Attempts left: 1 04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB <------------- 04 Jun 2013 03:14:55 (SrmSpaceManager) [] Successfully re-established connection to DB 04 Jun 2013 03:14:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:15:06 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:26 (SRM-t3se01) [] java.io.EOFException 04 Jun 2013 03:15:27 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:15:43 (SrmSpaceManager) [info GetSpaceTokens] An I/O error occured while sending to the backend. 04 Jun 2013 03:15:57 (SRM-t3se01) [v1:srmdelete:91498978] Initializing CA certificate store from directory: /etc/grid-security/certificates 04 Jun 2013 03:16:23 (SrmSpaceManager) [info GetLinkGroups] An I/O error occured while sending to the backend. </pre> ---------------- %ICON{arrowleft}% Go to [[CMSTier3LogXX][previous page]] / [[CMSTier3LogXX][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r2 - 2013-06-05
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback