本文共 4683 字,大约阅读时间需要 15 分钟。
参考日志错误信息:
[root@hh-yun-compute-130125 ~]# cat /var/log/messages | grep -i errorMar 1 04:58:05 hh-yun-compute-130125 kernel: sbridge: HANDLING MCE MEMORY ERRORMar 1 04:58:06 hh-yun-compute-130125 kernel: EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#2_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x16113a9000 => socket=1, Channel=2(mask=4), rank=0Mar 1 10:27:08 hh-yun-compute-130125 kernel: sbridge: HANDLING MCE MEMORY ERRORMar 1 10:27:09 hh-yun-compute-130125 kernel: EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#2_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x15e1c49000 => socket=1, Channel=2(mask=4), rank=0Mar 1 13:52:56 hh-yun-compute-130125 kernel: sbridge: HANDLING MCE MEMORY ERRORMar 1 13:52:57 hh-yun-compute-130125 kernel: EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#2_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x160e949000 => socket=1, Channel=2(mask=4), rank=0Mar 2 04:16:56 hh-yun-compute-130125 kernel: sbridge: HANDLING MCE MEMORY ERRORMar 2 04:16:56 hh-yun-compute-130125 kernel: sbridge: HANDLING MCE MEMORY ERRORMar 2 04:16:57 hh-yun-compute-130125 kernel: EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#2_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x1613a61000 => socket=1, Channel=2(mask=4), rank=0Mar 2 04:16:57 hh-yun-compute-130125 kernel: EDAC MC1: CE row 2, channel 0, label "CPU_SrcID#1_Channel#2_DIMM#0": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=6 Err=0008:00c2 (ch=2), addr = 0x1613a79000 => socket=1, Channel=2(mask=4), rank=0
参考信息2:
[root@hh-yun-compute-130125 ~]# cat /sys/devices/system/edac/mc/mc?/ce*count0080[root@hh-yun-compute-130125 ~]# cat /sys/devices/system/edac/mc/mc1/ce_count8
模块信息
[root@hh-yun-compute-130125 ~]# modinfo sb_edacfilename: /lib/modules/2.6.32-504.3.3.el6.x86_64/kernel/drivers/edac/sb_edac.kodescription: MC Driver for Intel Sandy Bridge and Ivy Bridge memory controllers - Ver: 1.1.0author: Red Hat Inc. (http://www.redhat.com)author: Mauro Carvalho Chehablicense: GPLsrcversion: 01CFEEBE911D55B6FE660BEalias: pci:v00008086d00002FA0sv*sd*bc*sc*i*alias: pci:v00008086d00000EA8sv*sd*bc*sc*i*alias: pci:v00008086d00003CA8sv*sd*bc*sc*i*depends: edac_corevermagic: 2.6.32-504.3.3.el6.x86_64 SMP mod_unload modversionsparm: edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int)[root@hh-yun-compute-130125 ~]# modinfo edac_corefilename: /lib/modules/2.6.32-504.3.3.el6.x86_64/kernel/drivers/edac/edac_core.kodescription: Core library routines for EDAC reportingauthor: Doug Thompson www.softwarebitmaker.com, et allicense: GPLsrcversion: C21E296292A2174839A086Cdepends:vermagic: 2.6.32-504.3.3.el6.x86_64 SMP mod_unload modversionsparm: check_pci_errors:Check for PCI bus parity errors: 0=off 1=on (int)parm: edac_pci_panic_on_pe:Panic on PCI Bus Parity error: 0=off 1=on (int)parm: edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int)parm: edac_mc_log_ue:Log uncorrectable error to console: 0=off 1=on (int)parm: edac_mc_log_ce:Log correctable error to console: 0=off 1=on (int)parm: edac_mc_poll_msec:Polling period in milliseconds
官方解释:
Total Correctable Errors count attribute file: 'ce_count' This attribute file displays the total count of correctable errors that have occurred on this csrow. This count is very important to examine. CEs provide early indications that a DIMM is beginning to fail. This count field should be monitored for non-zero values and report such information to the system administrator.
启用 mcelog
[root@hh-yun-compute-130125 ~]# service mcelogd restartStopping mcelog [确定]Starting mcelog daemon [确定][root@hh-yun-compute-130125 ~]# mcelogmcelog: Family 6 Model 3e CPU: only decoding architectural errors
查询日志
[root@hh-yun-compute-130125 ~]# tail /var/log/mcelogmcelog: failed to prefill DIMM database from DMI datamcelog: mcelog server already running相关评估
This is a harmless warning message. The DIMM database prefill relies on a specific non-standard format of the DIMMs in the DMI BIOS tables. If this format is not used by the BIOS, mcelog will only discover DIMMs as they get their first error (if the CPU reports DIMMs in machine check errors). Please understand for the most part, mcelog should be ignored.因此最终决定忽略该信息
转载地址:http://nonni.baihongyu.com/