Outils pour utilisateurs

Outils du site


tech:pb_memoire_cpu_hardware_mcelog

Ceci est une ancienne révision du document !


Pb memoire cpu hardware mcelog

Le paquet mcelog n'est plus pris en charge dans les noyaux 4.12 et suivants. rasdaemon peut être utilisé comme remplacement

rasdaemon

utility to receive RAS error tracings
 rasdaemon is a RAS (Reliability, Availability and Serviceability) logging
 tool.  It currently records memory errors, using the EDAC tracing events.
 EDAC are drivers in the Linux kernel that handle detection of ECC errors
 from memory controllers for most chipsets on x86 and ARM architectures.
 This userspace component consists of an init script which makes sure EDAC
 drivers and DIMM labels are loaded at system startup, as well as a utility
 for reporting current error counts from the EDAC sysfs files

I enable memory error reporting

http://www.mcelog.org/faq.html

chkconfig mcelog on
rcmcelog start
/etc/cron.hourly/mcelog.cron
#!/bin/bash
 
# is mcelog supported?
/usr/sbin/mcelog --supported >& /dev/null
if [ $? -eq 1 ]; then
       exit 1;
fi
 
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

http://askubuntu.com/questions/605369/mce-hardware-error-machine-check-events-logged-appears-in-syslog-what-sho

sudo apt-get install mcelog

The events will be logged to /var/log/mcelog. You can also run:

sudo mcelog --client 

II

# mcelog
mcelog: AMD Processor family 18: Please use the edac_mce_amd module instead.
: Success
CPU is unsupported
lsmod |grep edac_mce_amd
modprobe edac_mce_amd
echo edac_mce_amd >> /etc/modules

III

http://www.advancedclustering.com/act-kb/what-are-machine-check-exceptions-or-mce/

Paste or type the error message into a file, and then run it through the mcelog for example:

/usr/sbin/mcelog --k8 --ascii < myerror

Use the –k8 option if you are using an AMD Opteron or Athlon 64 processor, or substitute it for –p4 for a Pentium 4 or Xeon. Here is the output from the previous MCE error:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC b0ce27165dd3
 Northbridge Chipkill ECC error
 Chipkill ECC syndrome = 3700
 bit32 = err cpu0
 bit45 = uncorrected ecc error
 bit57 = processor context corrupt
 bit61 = error uncorrected
 bit62 = error overflow (multiple errors)
 bus error 'local node origin, request didn't time out
 generic read mem transaction
 memory access, level generic'
STATUS f600200137080813 MCGSTATUS 4

This indicates that an uncorrected ECC error occurred. This indicates that one of your memory modules has failed. For further analysis please submit a support ticket with the complete MCE error message and the output of mcelog.

tech/pb_memoire_cpu_hardware_mcelog.1742825205.txt.gz · Dernière modification : de 127.0.0.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki