Kernel panic - not syncing – CloudLinux

Symptoms

Your server is frequently rebooted with "Kernel panic - not synching" message (once per day or even more frequently). If you see no apparent reason for this behavior, collect kdump report and check the corresponding vmcore-dmesg.txt file. It is possible that you will find these errors there:

mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 0: b200000000030005
mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffbcd313bb> {intel_idle+0xeb/0x205}
mce: [Hardware Error]: TSC 32472304351e0
mce: [Hardware Error]: PROCESSOR 0:906ed TIME 1580407612 SOCKET 0 APIC a microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check

it can also report something of this sort:

mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 2: b200000000000014
mce: [Hardware Error]: RIP !INEXACT! 33:<00002b92644395bc>
mce: [Hardware Error]: TSC 1cfd79f052daee
mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1585837096 SOCKET 0 APIC e microcode ca
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Processor context corrupt
Kernel panic - not syncing: Fatal machine check

or simply

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler

Running # mcelog --ascii gives similar reports:

# mcelog --ascii
40584.877188] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200000000030005
[40584.877206] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffb513035b> {intel_idle+0xeb/0x205}
[40584.877224] mce: [Hardware Error]: TSC 84fd8ad9dd24
[40584.877234] mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1567895049 SOCKET 0 APIC 6 microcode ae
40584.877188] mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 0: b200000000030005
Hardware event. This is not a software error.
CPU 0 BANK 0 TSC 84fd8ad9dd24 
RIP !INEXACT! 10:ffffffffb513035b
TIME 1567895049 Sat Sep  7 23:24:09 2019
MCG status:
MCi status:
Machine check not valid
Corrected error
MCA: No Error
STATUS 0 MCGSTATUS 0
CPUID Vendor Intel Family 6 Model 158
RIP: intel_idle+0xeb/0x205}
SOCKET 0 APIC 6 microcode ae

You also use intel_idle driver on the server:

# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle

Among the CPU models that we see having these problems more often are:

Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Root Cause

This means that your CPU is likely using sophisticated C-state logic of modern CPUs. It can cause these issues when intel_idle driver doesn’t work properly with modern h/w C-states. Similar problems are reported with different distributions.

Solution

So basically we need to set up intel_idle module to use a simplified C-state logic. In order to do that, add intel_idle.max_cstate=1 entry to the kernel command line in /etc/default/grub.

For instance: GRUB_CMDLINE_LINUX="biosdevname=0 crashkernel=auto nomodeset rd.auto=1 consoleblank=0 intel_idle.max_cstate=1"

Then please regenerated GRUB configuration: grub2-mkconfig -o /boot/grub2/grub.cfg

You will need to reboot your server to apply the new configuration.

CloudLinux Knowledge Base

Kernel panic - not syncing

Symptoms

Root Cause

Solution

Comments

CloudLinux Knowledge Base

Symptoms

Root Cause

Solution

Related articles