Go to
previous page /
next page of Tier3 site log
26. 12. 2012 t3fs14 reboot on Dec 25th
On Dec 25th around 9 a.m. CET there was a Nagios warning about the host
t3fs14
being offline; other checks confirm the host was unavailable for some minutes; the OS log files show nothing suspicious
Dec 25 07:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6796', processed='center(received)=3399', processed='destination(d_loghost)=3399', processed='destination(d_boot)=0', processed='destination(d_auth)=2080', processed='destination(d_cron)=1259', processed='destination(d_mlal)=0', processed='destination(d_mesg)=54', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3399', suppressed='tcp(t3service01.psi.ch:1514)=0'
Dec 25 08:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6924', processed='center(received)=3463', processed='destination(d_loghost)=3463', processed='destination(d_boot)=0', processed='destination(d_auth)=2120', processed='destination(d_cron)=1282', processed='destination(d_mlal)=0', processed='destination(d_mesg)=55', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3463', suppressed='tcp(t3service01.psi.ch:1514)=0'
Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpuset
Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpu
Dec 25 09:07:32 t3fs14.psi.ch kernel: Command line: ro root=UUID=a247aed2-7b16-4306-9485-2adc3f62a6da rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=ttyS1,115200 elevator=noop irqpoll nr_cpus=1 reset_devices cgroup_disable=memory memmap=exactmap memmap=631K@4K memmap=134517K@49783K elfcorehdr=184300K memmap=4K$0K memmap=5K$635K memmap=64K$960K memmap=52K#3659964K memmap=75532K$3660020K memmap=2112K$4173824K memmap=8192K$4186112K
Dec 25 09:07:32 t3fs14.psi.ch kernel: KERNEL supported cpus:
no clues in the low level logs:
[root@t3admin01 ~]# ipmitool -I lanplus -H rmfs14 -U root -f /root/private/ipmi-pw sel elist
1 | 05/05/2011 | 08:53:33 | Power Supply Power Supply 2 | Failure detected | Asserted
2 | 06/22/2011 | 15:33:17 | Power Supply Power Supply 1 | Failure detected | Asserted
3 | 06/22/2011 | 15:36:55 | Power Supply Power Supply 1 | Failure detected | Asserted
4 | 02/06/2012 | 15:21:30 | Power Supply Power Supply 2 | Failure detected | Asserted
but by connecting to the HP Service processor (
ssh rmfs14
) I see:
>hpiLO-> show /system1/log1/record15
status=0
status_tag=COMMAND COMPLETED
Wed Dec 26 14:12:30 2012
/system1/log1/record15
Targets
Properties
number=15
severity=Critical
date=12/25/2012
time=09:16
description=ASR Detected by System ROM <---------
Verbs
cd version exit show set
that points me to an old
Raid controller FW, or maybe a broken CPU or a bug in the HP iLO3.
For the time being I've updated:
- the Raid Controller FW, but still a server reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13.
- the HP iLO3, rebooted automatically.
- downloaded the latest RDAC from NetApp, to be compiled.
- the Linux kernel to use a more recent Raid Controller Linux Driver, but a reboot + RDAC compilation + reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13.
After the automatic reboot everything seems to work ok except the d-cache pools (pools unavailable, automatic checks failing).
After checking the mounted file systems, issuing
/opt/d-cache/bin/dcache restart
, and waiting 10 minutes the SE operations went back to normal.
--
DanielMeister - 2012-12-26
Go to
previous page /
next page of Tier3 site log