Linux Kernelogic: December 2009

Thursday, December 31, 2009

Why does top output for cpu utilization differ from mpstat on smp systems

This can me reproduced by running the following commands:

# top
# mpstat

Notice, after you run the top command, press "1" to get a list of all cores. This will show the actual cup utilization for all the cores.

Example:

Top command run without pressing "1"
Cpu(s): 9.5%us, 12.0%sy, 0.0%ni, 77.9%id, 0.4%wa, 0.0%hi, 0.0%si, 0.2%st

Top command run after pressing "1"
Cpu0 : 2.0%us, 8.0%sy, 0.0%ni, 77.1%id, 12.6%wa, 0.0%hi, 0.0%si, 0.3%st
Cpu1 : 3.3%us, 7.3%sy, 0.0%ni, 89.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
Cpu2 : 18.2%us, 13.9%sy, 0.0%ni, 66.9%id, 0.0%wa, 0.0%hi, 0.0%si, 1.0%st
Cpu3 : 7.3%us, 12.6%sy, 0.0%ni, 80.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Output of mpstat command

[root@xen ~]# mpstat -P ALL 1
Linux 2.6.18-128.1.10.el5xen (xen.pnq.redhat.com) 05/23/2009

04:56:19 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
04:56:20 AM all 9.09 0.00 10.32 0.74 0.00 0.00 0.49 79.36 4929.70
04:56:20 AM 0 10.89 0.00 8.91 0.00 0.00 0.00 0.99 79.21 1289.11
04:56:20 AM 1 2.97 0.00 11.88 2.97 0.00 0.00 0.99 81.19 2083.17
04:56:20 AM 2 13.86 0.00 10.89 0.00 0.00 0.00 0.99 74.26 817.82
04:56:20 AM 3 8.49 0.00 10.38 0.00 0.00 0.00 0.00 81.13 727.72

Wednesday, December 30, 2009

Instaling a Kernel RPM

# yum install kernel
# rpm -ivh kernel-

Installing a new kernel is one of the few things you will do on your system that requires a reboot of the system. It also requires a little more thought and caution, as it is quite simple to render a system temporarily inoperable if you are careless when updating the kernel. Unlike just about any other upgrade you might do, you should NOT upgrade the kernel using rpm -U or -F.

Recall how rpm functions when upgrading: it determines whether a version already exists on the system, and if so, whether the version to be installed is newer. If it is newer, rpm first installs the new version, thereby replacing old files -- except those marked as configuration files. It then removes the old package, but only deletes files that do not exist in the new version.

Because -U and -F removes the previous kernel version, if your newly-installed kernel proves unstable, you could be left with an unbootable system. You would have to resort to alternate boot media such as a boot floppy or the CD-ROM. When you run an install instead of an upgrade, the old version of the kernel is still available and can be selected from the bootloader. yum knows to perform an “install” rather than an “upgrade” when working with a kernel package, whether you use yum update or yum install.

Because all of the kernel files are version-specific (that is, they either include version details in their names or are stored in version-specific paths), it is possible to install multiple versions of the kernel package. In addition, kernel modules are version-specific, and installation of a new kernel will require re-installation of any add-on kernel modules (not contained in the base package). Missing kernel modules can leave the system unable to dynamically load device drivers or other functions.

By default, the new kernel is automatically added to GRUB and made the default. You can change this behavior by editing /etc/sysconfig/kernel.

Virtual Memory Behavior in Linux

Introductory Terms:

Virtual Memory: A system that combines physical memory along with some sec-
ondary storage device to give the appearance that a computer system has more
physical memory than is actually installed. A virtual memory system tries to ef-
ﬁciently allocate physical memory (RAM) among many competing demands, in-
cluding: kernel code, kernel global data, dynamically allocated kernel memory,
kernel caches (buffer, page, swap, and slab), application code, application stack
space, static application memory, and application heap.

Page: Kernel memory management works on top of the computer system hardware
which manipulates memory in units called pages. Page size is determined solely by
the underlying hardware. The page size is 4096 bytes on IA32 hardware platforms.

Buffer Cache: The buffer cache holds ﬁlesystem metadata, such as the inode tables,
direct blocks, indirect blocks, journals, superblocks and ﬁlesystem bitmaps. Buffer
replacement is triggered by new buffer allocations which cause the VM to evict old
buffers from the system’s inactive list.

Page Cache: This is a read/write cache for ﬁles stored on a ﬁlesystem. It is re-
ported as Cached when the ﬁle /proc/meminfo is consulted. It includes regular
ﬁles opened for reading and writing, along with mmaped ﬁles, and pages of exe-
cutables currently in memory. (In 2.4.18 and later kernels, the page cache also con-
tains ﬁlesystem directories.) In addition, objects which appear in the ﬁle system
space but have no associated backing store (such as /proc ﬁles, pipes and FIFOs)
use memory in the page cache.

Swap Cache: This is a read/write cache for process data and stack pages that have
been written to the swap device. It is reported as SwapCached when the ﬁle
/proc/meminfo is consulted. The swap cache should be considered a virtual
cache, since there is no separate bookkeeping for it in the kernel. It is, however, a
convenient concept, and we shall refer to it in subsequent sections.

Active List: This is a collection of pages which the kernel has determined to be in
active use. The size of this list is reported as Active when the ﬁle /proc/meminfo
is consulted.

Inactive List: This set of pages resides in memory, but the kernel has determined
that they are not in active use. These pages are candidates for eviction should the
system come under memory pressure.

Out of Memory Killer (OOM): The OOM is an algorithm which is invoked when the
system senses a potentially fatal shortage of memory. The kernel attempts to kill
processes on a ’most good for least harm’ basis. This algorithm is only invoked
when the system is truly out of memory and swap, and more memory is needed
by the kernel to avoid a deadlock situation.

VM Killer: This algorithm is invoked after a critical shortage of memory has been
detected. It indiscriminately kills the process whose request for memory was im-
mediate and critical to the process, and where the request was impossible to satisfy.
In 2.4.9 and later kernels, this algorithm should never trigger. Such a trigger is con-
sidered a kernel bug. (In experimental kernels, the most commonly seen causes
of the VM killer invocation are poorly written ﬁlesystems and pathological VM
implementations.)

Proc file system

Read-only files:
/proc/ Information on running processes (ps, top)
/proc/cmdline Boot time options
/proc/cpuinfo processor information
/proc/mdstat software RAID information (mdadm)
/proc/meminfo system memory usage (free, vmstat)
/proc/swaps system memory usage (free, vmstat)
/proc/modules dynamically loaded modules (lsmod)
/proc/mounts mounted filesystems (mount)
/proc/net network activity and configuration (ifconfig, netstat)
/proc/partitions block devices known to the kernel
/proc/version version of the Linux kernel (uname)

Read-Write entries:
/proc/sys/kernel/hostname System hostname
/proc/sys/net/ipv4/ip_forward IP Forwarding (on or off)
/proc/sys/vm/drop_caches Writing a 1 forces the kernel to free up some memory from caches.
/proc/sys/vm/swappiness Indicates how aggressively memory will be swapped out to the swap device (number between 0 and 100).

Kernel Initialization

Kernel boot time functions
• Device detection
• Device driver initialization
• Mounts root filesystem read only
• Loads initial process (init)

The kernel initialization files generate good output, but scroll by quickly. A good way to examine this output is to view /var/log/dmesg, which contains a snapshot of these kernel messages taken just after control is passed to init. Review of this output will reveal the basic initialization steps of the Linux kernel.

Device drivers compiled into the kernel are called, and will attempt to locate their corresponding devices. If successful in locating the device, the driver will initialize and usually log output to the kernel message buffer.

If necessary (needed for boot) drivers have been compiled as modules instead of into the kernel, then they must be included in an initrd image, which is then temporarily mounted by the kernel on a RAM disk to make the modules available for the initialization process.

After all the necessary drivers are loaded, the kernel will mount the root filesystem read-only.

The first process is then loaded (init) and control is passed from the kernel to that process.

What is sysctl and its relation to /proc/sys?

Introduction

Sysctl and the /proc/sys ﬁlesystem give the user the ability to tune parameters on the actively running kernel.

Definitinon

Sysctl is a mechanism for conﬁguring kernel parameters at run time. At boot, the system reads the /etc/sysctl.conf ﬁle. Any parameter settings that are in the ﬁle become active once the ﬁle is read. The /proc/sys ﬁlesystem contains all of the actual parameters. These parameters can be changed at any time.

Examples

The kernel has a parameter that controls the tcp keep alive intervals. This parameter exists in:
/proc/sys/net/ipv4/tcp_keepalive_intvl

If you desire to look at the current value, you can do the following:
[root@station1 ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75

If you wish to set this value to 100, you can do the following:
[root@station1 ~]# echo 100 > /proc/sys/net/ipv4/tcp_keepalive_intvl
[root@station1 ~]# cat /proc/sys/net/ipv4/tcp_keepalive_intvl
100

When you reboot, the value that you changed will be lost. In order to have this change made persistent, add the following line to /etc/sysctl.conf:
net.ipv4.tcp_keepalive_intvl=100

Or run:
# sysctl -w net.ipv4.tcp_keepalive_intvl=100

Once you have done this, save the ﬁle and run:
[root@station1 ~]# sysctl -p
net.ipv4.ip_foreward = 0
kernel.sysrq = 0
net.ipv4.tcp_keepalive_intvl = 100

You should see the last line of the above output somewhere in your output.

To search for a kernel parameter related to 'cache' , you can run the following:
# sysctl -a | grep page
vm.flush_mmap_pages = 1
vm.pagecache = 100
vm.percpu_pagelist_fraction = 0
vm.nr_hugepages = 0
vm.page-cluster = 3
fs.nfs.fscache_from_pages = 0
fs.nfs.fscache_to_pages = 0
fs.nfs.fscache_uncache_page = 0

Why could my kernel panic ?

A kernel panic is the most serious form of computer crash. It occurs when the OS kernel detects an internal inconsistency or error that cannot be automatically resolved. Since there is no way for the kernel to reliably request human intervention in these matters, it `panics' and immediately interrupts the normal operation of the computer. What happens next depends on the kernel. Linux throws debugging information onto the screen and waits for a reboot. BeOS and NetBSD dump the user into the kernel debugger, which for the average user is equivalent to the Linux behavior. Windows NT displays the dreaded Blue Screen of Death. The only thing all kernel panics have in common is that the only way to restore normal functionality is to reboot. The actual term `kernel panic' is normally only heard in the context of Unix and Unix-like systems; other systems have other terms for what is essentially the same thing.

There are several ways a kernel panic may be caused. One of the most common is bad hardware. If kernel memory is corrupted by a hardware fault, the kernel will likely panic. Kernel bugs can also cause panics, but bugs that are severe enough to cause panics should not be found in stable kernel versions, and even in beta kernels they should be rare. Panics can also occur during the boot sequence, if the conditions for a successful boot are not met. Under *nix systems the most common of these cases is if the root filesystem is for some reason not mountable. In particular, if the root filesystem is remote and the connection times out, the kernel will panic. Also, if init is not found on the root filesystem, the kernel will panic.

The Linux kernel defines the mechanism for a panic as a function panic() in kernel/panic.c. 416 other source files in 2.4.19 then contain calls to panic(). Here's why kernel subsystems panic:
* Failure of a memory allocation or structure creation which should always succeed
* Unrecoverable filesystem errors. Some filesystems can have their error recovery mechanisms set to panic when a normally non-fatal but
possibly serious error occurs.
* Task exit during an interrupt handler (the infamous Aiee, killing interrupt handler! (see rescdsk's writup there for more info))
* Complete memory exhaustion (different from above, though related)
* Locking failure
* Missing hardware features or serious hardware exceptions
* Interrupt glitches, including sleeping interrupt handlers
* Premature and unexpected destruction of kernel structures
* Failure to load essential drivers
* SMP concurrency errors

Load Average

In Linux, apart from running processes, processes in waiting (run queue, uninterruptible sleep etc.) also contributes to load-average calculation.
If at any time the number of waiting processes increases, the load-average will also increase.
Thus, high load average does not always indicate an issue.

Linux Kernelogic

Search This Blog