Understanding the High Load Average root cause

Issue

What is the Load Average value in Linux and how it's calculated?

How to find the root cause of a high load?

Environment

CloudLinux (all versions)

Resolution

This article's purpose is to explain how a load average value is calculated, how to make use of it, and how to determine what's making load average values grow.

What is the Load Average value in Linux and how it's calculated?

According to a kernel developer comment in the loadavg.c function in the Linux kernel code, load average is: "an exponentially decaying average of nr_running + nr_uninterruptible".

Where:

nr_running - number of running processes
nr_uninterruptible - number of processes in the uninterruptable sleep state

A general rule for Linux is to mark process states this way:

R - Process in a running state. Processes in this state usually consume the CPU processing time and have all data needed for execution at the moment.
D - Process in an uninterruptable sleep state. In simple words, such kind of process is waiting for data to be prepared and passed to memory so the CPU could work with it. It's safe to assume that all such processes are waiting for input/output operations to be completed before the process can transition to a running state.

On the screenshot below you can see the output of the 'top' command-line utility with a state column marked red. In this example, the 'dd' process is in the 'R' state, which means it's running and consuming CPU right now.

Is my load high enough to worry about?

What can be considered as "high"? When you constantly see a Load Average higher than the number of cores on a server (load average above 8, on an 8-core CPU system), most likely there's some issue on a server or it just can't handle the load.

However, you shouldn't look at the load average value itself unless you have a problem related to the slowness of something running on a server or the load average is very high.

Overall, you should consider load average value as an indicator that helps with troubleshooting server/app slowness. The value itself can't be an issue. Server with a very high load average can still be responsive and work without issues. However, in most types of workloads, you'll want to keep your load average value the same as your CPU cores count most of the time.

How to find the root cause of a high load?

Since we've already determined that there are only 2 types of processes that are affecting a load average value, the key to finding out what's the root cause is finding what type of resource is lacking and then finding out where it's gone.

Processes in the running state consume CPU time, so the higher the CPU usage, the higher the load will be.
Processes in the 'D' state are waiting for input/output, which is usually a storage device. If it's too slow and the CPU spends much time waiting for data to be loaded - the load average will grow.

The 'top' utility is an excellent tool that can help you to find what your server lacks - CPU, IO, or both.

Let's check an example of the 'top' command output below:

top - 09:42:28 up 10 min,  1 user,  load average: 62.00, 53.86, 28.90
Tasks: 1185 total, 116 running, 1065 sleeping,   1 stopped,   3 zombie
%Cpu(s): 81.9 us, 13.6 sy,  3.6 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.8 si,  0.0 st
KiB Mem : 65367024 total,  7550156 free, 18067028 used, 39749840 buff/cache
KiB Swap:  8285180 total,  8285180 free,        0 used. 36767016 avail Mem
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  2983 mysql     20   0 2569424 667244  15544 S 216.7  1.0  24:28.86 /usr/sbin+
362876 mindcube  20   0  432536  50112   4536 R  87.5  0.1   0:00.56 lsphp:be/+
363002 printing  20   0  478240  46248  25780 S  87.5  0.1   0:00.39 lsphp:/ho+
351237 thegandh  20   0  588316 108748  66180 R  62.5  0.2   0:03.57 lsphp:/ho+
355710 sanskrit  20   0 1578340 711328  38276 R  62.5  1.1   0:07.46 lsphp:ans+

This is an output from a system with 12 cores. Load average for 1 minute is 5x higher than CPU cores count, so it can be considered as high.
In the above strings we need to understand the meaning of these 2:

Tasks: 1185 total, 116 running, 1065 sleeping, 1 stopped, 3 zombie
%Cpu(s): 81.9 us, 13.6 sy, 3.6 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.8 si, 0.0 st

In the tasks field, everything is simple - '116 running' means that there are 116 tasks in the 'R' state, that's a bit too much. So, we can make a conclusion that processes are stacking up and processing too slow.
To understand why, let's examine the second string, which shows CPU utilization. It's divided into categories:

us, user: percentage of CPU time running user processes (including root)
sy, system: time running kernel processes. These processes are needed to keep your system alive and operational
ni, nice: time running niced user processes. Niced processes are the ones that have custom priority set with the 'nice' utility.
id, idle: time spent doing nothing
wa, IO-wait: time waiting for I/O completion
hi: time spent servicing hardware interrupts
si: time spent servicing software interrupts
st: time stolen from this VM by the hypervisor

In this case, we can see that idle time is zero, and more than 80% of the CPU is occupied with user tasks, so, the next task will be determining which tasks occupied that many resources.

CloudLinux Knowledge Base

Issue

Environment

Resolution

What is the Load Average value in Linux and how it's calculated?

Is my load high enough to worry about?

How to find the root cause of a high load?

Comments

CloudLinux Knowledge Base

Issue

Environment

Resolution

What is the Load Average value in Linux and how it's calculated?

Is my load high enough to worry about?

How to find the root cause of a high load?

Related articles