Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

13. 03. 2015 t3wn[30-40] RAM errors

EDCA RAM errors in one server

sframe_main[156845]: segfault at 100000018 ip 0000003013e75ef5 sp 00007fffa538aab0 error 4 in libc-2.12.so[3013e00000+18a000]
sframe_main[156553]: segfault at 100000018 ip 0000003013e75ef5 sp 00007fff051f8a60 error 4 in libc-2.12.so[3013e00000+18a000]
sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 8: 8c00004e000800c0
TSC 0 ADDR 658238000 MISC 908440004001c8c PROCESSOR 0:206d7 TIME 1425414669 SOCKET 1 APIC 20
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x658238000 => socket=1, Channel=0(mask=1), rank=1

sframe_main[22853]: segfault at 100000018 ip 0000003013e75ef5 sp 00007fff6cdff7c0 error 4 in libc-2.12.so[3013e00000+18a000]
sframe_main[23825]: segfault at 100000036 ip 0000003013e75ef5 sp 00007fffb208f020 error 4 in libc-2.12.so[3013e00000+18a000]
sframe_main[23827]: segfault at 100000036 ip 0000003013e75ef5 sp 00007fff6ea0ca70 error 4 in libc-2.12.so[3013e00000+18a000]
sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 5: 8c00004000010090
TSC 0 ADDR 658238600 MISC 421efe86 PROCESSOR 0:206d7 TIME 1425716125 SOCKET 1 APIC 20
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=8 Err=0001:0090 (ch=0), addr = 0x658238600 => socket=1, Channel=0(mask=1), rank=1

sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 8: 8c00004e000800c0
TSC 0 ADDR 658238000 MISC 908440004001c8c PROCESSOR 0:206d7 TIME 1425757548 SOCKET 1 APIC 20
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x658238000 => socket=1, Channel=0(mask=1), rank=1

...

sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 5: 8c00004000010090
TSC 0 ADDR 658238600 MISC 4214f486 PROCESSOR 0:206d7 TIME 1425986107 SOCKET 1 APIC 20
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=8 Err=0001:0090 (ch=0), addr = 0x658238600 => socket=1, Channel=0(mask=1), rank=1

sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 8: 8c00004e000800c0
TSC 0 ADDR 658238000 MISC 908440004001c8c PROCESSOR 0:206d7 TIME 1426091335 SOCKET 1 APIC 20
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x658238000 => socket=1, Channel=0(mask=1), rank=1

sbridge: HANDLING MCE MEMORY ERROR
CPU 8: Machine Check Exception: 0 Bank 5: 8c00004000010090
TSC 0 ADDR 658238600 MISC 421cfc86 PROCESSOR 0:206d7 TIME 1426193819 SOCKET 1 APIC 20
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:   section_type: memory error
[Firmware Warn]: error section length is too small
EDAC MC1: CE row 0, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=8 Err=0001:0090 (ch=0), addr = 0x658238600 => socket=1, Channel=0(mask=1), rank=1

EDAC RAM analysis

Only relevant outputs:
[root@t3admin01 ~]# salt 't3wn3*' cmd.run 'edac-util'
t3wn33.psi.ch:
    mc0: csrow0: CPU_SrcID#0_Channel#0_DIMM#0: 108 Corrected Errors
    mc0: csrow2: CPU_SrcID#0_Channel#2_DIMM#0: 119 Corrected Errors
    mc1: csrow0: CPU_SrcID#1_Channel#0_DIMM#0: 52 Corrected Errors
    mc1: csrow2: CPU_SrcID#1_Channel#2_DIMM#0: 15 Corrected Errors
t3wn39.psi.ch:
    mc1: csrow0: CPU_SrcID#1_Channel#0_DIMM#0: 10 Corrected Errors
t3wn34.psi.ch:
    mc0: csrow0: CPU_SrcID#0_Channel#0_DIMM#0: 1 Corrected Errors
    mc0: csrow2: CPU_SrcID#0_Channel#2_DIMM#0: 2 Corrected Errors

[root@t3admin01 ~]# salt 't3wn4*' cmd.run 'edac-util'
t3wn40.psi.ch:
    mc1: csrow0: CPU_SrcID#1_Channel#0_DIMM#0: 10 Corrected Errors 


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...


This topic: CmsTier3 > WebHome > CMSTier3Log > CMSTier3Log68
Topic revision: r2 - 2015-03-13 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback