Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log35][previous page]] / [[CMSTier3Log37][next page]] of Tier3 site log %M% ---+ 26. 12. 2012 t3fs14 reboot on Dec 25th On Dec 25th around 9 a.m. CET there was a Nagios warning about the host =t3fs14= being offline; other checks confirm the host was unavailable for some minutes; the OS log files show nothing suspicious <pre> Dec 25 07:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6796', processed='center(received)=3399', processed='destination(d_loghost)=3399', processed='destination(d_boot)=0', processed='destination(d_auth)=2080', processed='destination(d_cron)=1259', processed='destination(d_mlal)=0', processed='destination(d_mesg)=54', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3399', suppressed='tcp(t3service01.psi.ch:1514)=0' Dec 25 08:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6924', processed='center(received)=3463', processed='destination(d_loghost)=3463', processed='destination(d_boot)=0', processed='destination(d_auth)=2120', processed='destination(d_cron)=1282', processed='destination(d_mlal)=0', processed='destination(d_mesg)=55', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3463', suppressed='tcp(t3service01.psi.ch:1514)=0' Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpuset Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpu Dec 25 09:07:32 t3fs14.psi.ch kernel: Command line: ro root=UUID=a247aed2-7b16-4306-9485-2adc3f62a6da rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=ttyS1,115200 elevator=noop irqpoll nr_cpus=1 reset_devices cgroup_disable=memory memmap=exactmap memmap=631K@4K memmap=134517K@49783K elfcorehdr=184300K memmap=4K$0K memmap=5K$635K memmap=64K$960K memmap=52K#3659964K memmap=75532K$3660020K memmap=2112K$4173824K memmap=8192K$4186112K Dec 25 09:07:32 t3fs14.psi.ch kernel: KERNEL supported cpus: </pre> no clues in the low level logs: <pre> [root@t3admin01 ~]# ipmitool -I lanplus -H rmfs14 -U root -f /root/private/ipmi-pw sel elist 1 | 05/05/2011 | 08:53:33 | Power Supply Power Supply 2 | Failure detected | Asserted 2 | 06/22/2011 | 15:33:17 | Power Supply Power Supply 1 | Failure detected | Asserted 3 | 06/22/2011 | 15:36:55 | Power Supply Power Supply 1 | Failure detected | Asserted 4 | 02/06/2012 | 15:21:30 | Power Supply Power Supply 2 | Failure detected | Asserted </pre> but by connecting to the HP Service processor ( =ssh rmfs14= ) I see: <pre> </>hpiLO-> show /system1/log1/record15 status=0 status_tag=COMMAND COMPLETED Wed Dec 26 14:12:30 2012 /system1/log1/record15 Targets Properties number=15 severity=Critical date=12/25/2012 time=09:16 description=ASR Detected by System ROM <--------- Verbs cd version exit show set </pre> that points me to an old [[http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/quot-ASR-Detected-by-System-ROM-quot-error-message-in-HP/td-p/4732825][Raid controller FW]], or maybe a broken CPU or a bug in the HP iLO3. For the time being I've updated: * the [[NodeTypedCachet3fs13t3fs14#HW_Raid_controller][Raid Controller FW]], but still a server reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13. * the [[HPProLiantDL380G7ILO3#Firmware_update][HP iLO3]], rebooted automatically. * downloaded the latest RDAC from NetApp, to be compiled. * the Linux kernel to use a more recent Raid Controller Linux Driver, but a reboot + [[NodeTypedCachet3fs13t3fs14#LSI_RDAC_Redundant_Dual_Active_C][RDAC]] compilation + reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13. After the automatic reboot everything seems to work ok except the d-cache pools (pools unavailable, automatic checks failing). After checking the mounted file systems, issuing =/opt/d-cache/bin/dcache restart=, and waiting 10 minutes the SE operations went back to normal. -- Main.DanielMeister - 2012-12-26 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log35][previous page]] / [[CMSTier3Log37][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r6 - 2016-11-04
-
FabioMartinelli
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback