<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log35][previous page]] / [[CMSTier3Log37][next page]] of Tier3 site log %M% ---+ 26. 12. 2012 t3fs14 reboot on Dec 25th On Dec 25th around 9 a.m. CET there was a Nagios warning about the host =t3fs14= being offline; other checks confirm the host was unavailable for some minutes; the OS log files show nothing suspicious <pre> Dec 25 07:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6796', processed='center(received)=3399', processed='destination(d_loghost)=3399', processed='destination(d_boot)=0', processed='destination(d_auth)=2080', processed='destination(d_cron)=1259', processed='destination(d_mlal)=0', processed='destination(d_mesg)=54', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3399', suppressed='tcp(t3service01.psi.ch:1514)=0' Dec 25 08:12:07 t3fs14.psi.ch syslog-ng[2520]: Log statistics; dropped='tcp(t3service01.psi.ch:1514)=0', processed='center(queued)=6924', processed='center(received)=3463', processed='destination(d_loghost)=3463', processed='destination(d_boot)=0', processed='destination(d_auth)=2120', processed='destination(d_cron)=1282', processed='destination(d_mlal)=0', processed='destination(d_mesg)=55', processed='destination(d_cons)=0', processed='destination(d_spol)=0', processed='destination(d_mail)=4', processed='source(s_local)=3463', suppressed='tcp(t3service01.psi.ch:1514)=0' Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpuset Dec 25 09:07:32 t3fs14.psi.ch kernel: Initializing cgroup subsys cpu Dec 25 09:07:32 t3fs14.psi.ch kernel: Command line: ro root=UUID=a247aed2-7b16-4306-9485-2adc3f62a6da rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us console=ttyS1,115200 elevator=noop irqpoll nr_cpus=1 reset_devices cgroup_disable=memory memmap=exactmap memmap=631K@4K memmap=134517K@49783K elfcorehdr=184300K memmap=4K$0K memmap=5K$635K memmap=64K$960K memmap=52K#3659964K memmap=75532K$3660020K memmap=2112K$4173824K memmap=8192K$4186112K Dec 25 09:07:32 t3fs14.psi.ch kernel: KERNEL supported cpus: </pre> no clues in the low level logs: <pre> [root@t3admin01 ~]# ipmitool -I lanplus -H rmfs14 -U root -f /root/private/ipmi-pw sel elist 1 | 05/05/2011 | 08:53:33 | Power Supply Power Supply 2 | Failure detected | Asserted 2 | 06/22/2011 | 15:33:17 | Power Supply Power Supply 1 | Failure detected | Asserted 3 | 06/22/2011 | 15:36:55 | Power Supply Power Supply 1 | Failure detected | Asserted 4 | 02/06/2012 | 15:21:30 | Power Supply Power Supply 2 | Failure detected | Asserted </pre> but by connecting to the HP Service processor ( =ssh rmfs14= ) I see: <pre> </>hpiLO-> show /system1/log1/record15 status=0 status_tag=COMMAND COMPLETED Wed Dec 26 14:12:30 2012 /system1/log1/record15 Targets Properties number=15 severity=Critical date=12/25/2012 time=09:16 description=ASR Detected by System ROM <--------- Verbs cd version exit show set </pre> that points me to an old [[http://h30499.www3.hp.com/t5/ProLiant-Servers-ML-DL-SL/quot-ASR-Detected-by-System-ROM-quot-error-message-in-HP/td-p/4732825][Raid controller FW]], or maybe a broken CPU or a bug in the HP iLO3. For the time being I've updated: * the [[NodeTypeFileServerHP#HW_Raid_controller][Raid Controller FW]], but still a server reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13. * the [[HPProLiantDL380G7ILO3#Firmware_update][HP iLO3]], rebooted automatically. * downloaded the latest RDAC from NetApp, to be compiled. * the Linux kernel to use a more recent Raid Controller Linux Driver, but a reboot + [[NodeTypeFileServerHP#LSI_RDAC_Redundant_Dual_Active_C][RDAC]] compilation + reboot is needed; I'll reboot during the scheduled PSI downtime on Jan '13. After the automatic reboot everything seems to work ok except the d-cache pools (pools unavailable, automatic checks failing). After checking the mounted file systems, issuing =/opt/d-cache/bin/dcache restart=, and waiting 10 minutes the SE operations went back to normal. -- Main.DanielMeister - 2012-12-26 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log35][previous page]] / [[CMSTier3Log37][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log36
Topic revision: r4 - 2012-12-27 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback