Symptoms
Your server is frequently rebooted with "Kernel panic - not synching" message (once per day or even more frequently). If you see no apparent reason for this behavior, collect kdump report and check the corresponding vmcore-dmesg.txt file. It is possible that you will find these errors there:
mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 0: b200000000030005
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffbcd313bb> {intel_idle+0xeb/0x205}
mce: [Hardware Error]: TSC 32472304351e0
mce: [Hardware Error]: PROCESSOR 0:906ed TIME 1580407612 SOCKET 0 APIC a microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
it can also report something of this sort:
mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 2: b200000000000014
mce: [Hardware Error]: RIP !INEXACT! 33:<00002b92644395bc>
mce: [Hardware Error]: TSC 1cfd79f052daee
mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1585837096 SOCKET 0 APIC e microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check
or simply
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Running # mcelog --ascii
gives similar reports:
# mcelog --ascii
40584.877188] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200000000030005
[40584.877206] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffb513035b> {intel_idle+0xeb/0x205}
[40584.877224] mce: [Hardware Error]: TSC 84fd8ad9dd24
[40584.877234] mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1567895049 SOCKET 0 APIC 6 microcode ae
40584.877188] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200000000030005
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 84fd8ad9dd24
RIP !INEXACT! 10:ffffffffb513035b
TIME 1567895049 Sat Sep 7 23:24:09 2019
MCG status:
MCi status:
Machine check not valid
Corrected error
MCA: No Error
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 158
RIP: intel_idle+0xeb/0x205}
SOCKET 0 APIC 6 microcode ae
You also use intel_idle driver on the server:
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
Among the CPU models that we see having these problems more often are:
- Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
- Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
- Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Root Cause
This means that your CPU is likely using sophisticated C-state logic of modern CPUs. It can cause these issues when intel_idle driver doesn’t work properly with modern h/w C-states. Similar problems are reported with different distributions.
Solution
So basically we need to set up intel_idle module to use a simplified C-state logic. In order to do that, add intel_idle.max_cstate=1
entry to the kernel command line in /etc/default/grub
.
For instance: GRUB_CMDLINE_LINUX="biosdevname=0 crashkernel=auto nomodeset rd.auto=1 consoleblank=0 intel_idle.max_cstate=1"
Then please regenerated GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg
You will need to reboot your server to apply the new configuration.
Comments
0 comments
Please sign in to leave a comment.