This article will tell us how to do performance analysis on Linux.
Performance analysis and bottleneck determination in Linux is not rocket science. It requires some basic knowledge of the hardware and kernel architecture and the use of some standard tools. Using a hands-on approach we‘ll walk readers through the different subsystems and the key indicators, to understand which component constitutes the current bottleneck of a system.
Sometimes, sysadmins know that a system needs to be upgraded, but they can not precise why, or what part needs to be upgraded. In many cases, some of the bottlenecks can be resolved by just tuning the system, without any hardware upgrade at all.
First of all let‘s review the basics:
For performance analysis, there are three general type of critical resources: processing (execution), memory and I/O (transmission)。
Memory is comprised of real memory and virtual memory (real memory plus swapping space)。
I/O can be split among two big categories: disk I/O and network I/O.
In a Unix system (as in almost any timeshare system) processes are in either one of two states: running or sleeping. Processes that are sleeping, are in that state either because they exhausted all their CPU time for the current round (i.e. the kernel moved them preemptively to sleeping mode until all other runnable processes complete their CPU timeslot as well) or because they are blocking (i.e. they are waiting for a resource that‘s currently unavailable, like disk I/O, network I/O, terminal I/O, etc.)。 If the sleeping process has been moved away from main memory to swapping space (paged out), it could be in a third mode, which is just “ready to run but waiting to be paged in”。
The kernel normally keeps a list of all running processes and marks them according to their state. If the process is tagged as blocked because it is waiting for some resource, then it won‘t be runnable again until that resource becomes available. If the process is just sleeping because it used all it’s CPU quota, it will remain in that state until all other runnable processes have the opportunity of taking their CPU share as well; after that, the process becomes runnable again and it will be scheduled for execution based on its priority.
Priorities define how much relative CPU time is assigned to a given process. Processes are assigned two values: a static nice value, which is a number between -20 and 19 (where lower means less nice that equals to more priority) that can be defined by the user in runtime (using the ‘nice’ and ‘renice’ commands), and a real time priority value between 0 and 99 (higher is more priority) which is a dynamic value that the scheduler assigns depending on many factors.
Virtual memory is defined as the total amount of memory in the system, and it includes real memory (RAM) as well as swapping space. Real memory is split among four pools: free memory, cache (free memory that has been dynamically assigned to filesystem cache, and it can be deallocated to be reused as needed), I/O buffers (which include in-kernel network and I/O buffers) and used memory (memory allocated to processes)。
The Kernel Virtual Memory Manager can move processes out of real memory and into swap space if there is pressure for free memory in the system (and according to the current settings in /proc/sys/vm/)。 If a page residing in swap space needs to be executed, it is brought back into real memory. In the rare case of running out of usable virtual memory, the Out of Memory Killer (OOM Killer) is awaken and it tries to use heuristics to identify processes to be killed in order to reclaim some free real memory back to the system.
Now let‘s go to the tools that we will use to identify which component is the culprit for the low performance of our system.
The best tool to analyze the state of the system as a whole is ‘vmstat’。 Vmstat, when executed with a numerical parameter as an interval in seconds (as in ‘vmstat 5’) it will show the totals since uptime initially, and then it will refresh with the deltas for each interval. A typical output looks like this:
The top headers define which data you‘re looking at:
* Procs: number of processes in runnable ( r ) state and in uninterruptible (b – blocking) sleep * Memory: amount of memory swapped (swpd), free, used for buffers (buff) and used for filesystem cache (cache)
* Swap: amount of memory swapped in (si) and swapped out (so) per second * IO: blocks received from a block device (bi) and blocks sent to a block device (bo) per second * System: numer of interrupts per second (in) and number of context switches per second (cs)
* CPU: percentages of total CPU time spent in user space (us), kernel space (sy), idle (id) and waiting for I/O (wa)
Interpretation of these values is not too difficult. Let‘s go through a few real world examples.
If you have a system that consistently has high number of processes in runnable state ( r ), it usually means that you need to either add additional CPU‘s or replace the current CPU’s for faster ones, as processes are starving for CPU. It‘s important to not that the length in the queue of processes in runnable state keeps a proportion with the number of CPU’s in the system (i.e. 5 or 6 processes in runnable state in a quad CPU system may be considered similar to having 1 or 2 processes in runnable state in a single CPU system)。
If you‘re looking for places to optimize things, look into the CPU utilization numbers as well, as they will tell you if the high utilization is in kernel or user space (sy and us numbers)。 Running “top” in such a system will also give you an idea of which processes the CPU spends more time on, thus pinpointing who is the primary candidate for optimization.
In the special case where there are more than one CPU in the system, the CPU section of vmstat shows the averages, so it may not very accurate (i.e. when there is only one heavy process running which is single threaded)。 ‘mpstat -P ALL 5’ can be used to show the CPU statistics on an aggregated and on a per CPU basis.