{{tag>Brouillon Matériel CPU Mémoire Arch}}
= Pb memoire cpu hardware mcelog
** Le paquet mcelog n'est plus pris en charge dans les noyaux 4.12 et suivants. rasdaemon peut être utilisé comme remplacement **
rasdaemon
utility to receive RAS error tracings
rasdaemon is a RAS (Reliability, Availability and Serviceability) logging
tool. It currently records memory errors, using the EDAC tracing events.
EDAC are drivers in the Linux kernel that handle detection of ECC errors
from memory controllers for most chipsets on x86 and ARM architectures.
This userspace component consists of an init script which makes sure EDAC
drivers and DIMM labels are loaded at system startup, as well as a utility
for reporting current error counts from the EDAC sysfs files
== I enable memory error reporting
http://www.mcelog.org/faq.html
chkconfig mcelog on
rcmcelog start
''/etc/cron.hourly/mcelog.cron''
#!/bin/bash
# is mcelog supported?
/usr/sbin/mcelog --supported >& /dev/null
if [ $? -eq 1 ]; then
exit 1;
fi
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog
http://askubuntu.com/questions/605369/mce-hardware-error-machine-check-events-logged-appears-in-syslog-what-sho
sudo apt-get install mcelog
The events will be logged to /var/log/mcelog. You can also run:
sudo mcelog --client
== II
# mcelog
mcelog: AMD Processor family 18: Please use the edac_mce_amd module instead.
: Success
CPU is unsupported
lsmod |grep edac_mce_amd
modprobe edac_mce_amd
echo edac_mce_amd >> /etc/modules
== III
http://www.advancedclustering.com/act-kb/what-are-machine-check-exceptions-or-mce/
Paste or type the error message into a file, and then run it through the mcelog for example:
/usr/sbin/mcelog --k8 --ascii < myerror
Use the –k8 option if you are using an AMD Opteron or Athlon 64 processor, or substitute it for –p4 for a Pentium 4 or Xeon. Here is the output from the previous MCE error:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC b0ce27165dd3
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 3700
bit32 = err cpu0
bit45 = uncorrected ecc error
bit57 = processor context corrupt
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS f600200137080813 MCGSTATUS 4
This indicates that an uncorrected ECC error occurred. This indicates that one of your memory modules has failed. For further analysis please submit a support ticket with the complete MCE error message and the output of mcelog.