Linux Kernelogic: 2010

Thursday, May 27, 2010

How Many Open Files?

We knew the answer to this question once, back when the world was young and full of truth. Without hesitation, we'd have spouted "Just take the output of lsof | wc -l!" And it's true enough, in a general sort of way. But if you asked me the same question now, my answer would be: "Why do you want to know?"

Are you, for instance, attempting to determine whether the number of open files is exceeding the limit set in the kernel? Then the output of lsof will be practically useless.

On *nix systems, there is a limit set in the kernel on how many open file descriptors are allowed on the system. This may be compiled in, or it may be tunable on the fly. In Linux, the value of this parameter can be read from and written to the proc filesystem.
[root@srv-4 proc]# cat /proc/sys/fs/file-max
52427

On this system 52,427 open file descriptors are permitted. We are unlikely to run out. If we wanted to change it, we'd do something like this:
[root@srv-4 proc]# echo "104854" > /proc/sys/fs/file-max
[root@srv-4 proc]# cat /proc/sys/fs/file-max
104854

(Warning, change kernel parameters on the fly only with great caution and good cause) Can also be achieved by setting the vm.file-max paramater in /etc/sysctl.conf

But how do we know how many file descriptors are being used?
[root@srv-4 proc]# cat /proc/sys/fs/file-nr
3391 969 52427
| | |
| | |
| | maximum open file descriptors
| total free allocated file descriptors
total allocated file descriptors
(the number of file descriptors allocated since boot)

The number of open file descriptors is column 1 - column 2; 2325 in this case. (Note: we have read contradictory definitions of the second column in newsgroups. Some people say it is the number of used allocated file descriptors - just the opposite of what we've stated here. Luckily, we can prove that the second number is free descriptors. Just launch an xterm or any new process and watch the second number go down.)

What about the method we mentioned earlier, running lsof to get the number of open files?
[root@srv-4 proc]# lsof | wc -l
4140

Oh my. There is quite a discrepancy here. The lsof utility reports 4140 open files, while /proc/sys/fs/file-nr says, if our math is correct, 2325. So how many open files are there?

What is an open file?

Is an open file a file that is being used, or is it an open file descriptor? A file descriptor is a data structure used by a program to get a handle on a file, the most well know being 0,1,2 for standard in, standard out, and standard error. The file-max kernel parameter refers to open file descriptors, and file-nr gives us the current number of open file descriptors. But lsof lists all open files, including files which are not using file descriptors - such as current working directories, memory mapped library files, and executable text files. To illustrate, let's examine the difference between the output of lsof for a given pid and the file descriptors listed for that pid in /proc.

Pick a PID, any PID

Let's look at this process:
usr-4 2381 0.0 0.5 5168 2748 pts/14 S 14:42 0:01 vim openfiles.html

root@srv-4 usr-4]# lsof | grep 2381
vim 2381 usr-4 cwd DIR 3,8 4096 2621517 /n
vim 2381 usr-4 rtd DIR 3,5 4096 2 /
vim 2381 usr-4 txt REG 3,2 2160520 34239 /usr/bin/vim
vim 2381 usr-4 mem REG 3,5 85420 144496 /lib/ld-2.2.5.so
vim 2381 usr-4 mem REG 3,2 371 20974 /usr/lib/locale/LC_IDENTIFICATION
vim 2381 usr-4 mem REG 3,2 20666 192622 /usr/lib/gconv/gconv-modules.cache
vim 2381 usr-4 mem REG 3,2 29 20975 /usr/lib/locale/LC_MEASUREMENT
vim 2381 usr-4 mem REG 3,2 65 20979 /usr/lib/locale/LC_TELEPHONE
vim 2381 usr-4 mem REG 3,2 161 19742 /usr/lib/locale/LC_ADDRESS
vim 2381 usr-4 mem REG 3,2 83 20977 /usr/lib/locale/LC_NAME
vim 2381 usr-4 mem REG 3,2 40 20978 /usr/lib/locale/LC_PAPER
vim 2381 usr-4 mem REG 3,2 58 51851 /usr/lib/locale/LC_MESSAGES/SYSL
vim 2381 usr-4 mem REG 3,2 292 20976 /usr/lib/locale/LC_MONETARY
vim 2381 usr-4 mem REG 3,2 22592 99819 /usr/lib/locale/LC_COLLATE
vim 2381 usr-4 mem REG 3,2 2457 20980 /usr/lib/locale/LC_TIME
vim 2381 usr-4 mem REG 3,2 60 35062 /usr/lib/locale/LC_NUMERIC
vim 2381 usr-4 mem REG 3,2 290511 64237 /usr/lib/libncurses.so.5.2
vim 2381 usr-4 mem REG 3,2 24565 64273 /usr/lib/libgpm.so.1.18.0
vim 2381 usr-4 mem REG 3,5 11728 144511 /lib/libdl-2.2.5.so
vim 2381 usr-4 mem REG 3,5 22645 144299 /lib/libcrypt-2.2.5.so
vim 2381 usr-4 mem REG 3,5 10982 144339 /lib/libutil-2.2.5.so
vim 2381 usr-4 mem REG 3,5 105945 144516 /lib/libpthread-0.9.so
vim 2381 usr-4 mem REG 3,5 169581 144512 /lib/libm-2.2.5.so
vim 2381 usr-4 mem REG 3,5 1344152 144297 /lib/libc-2.2.5.so
vim 2381 usr-4 mem REG 3,2 173680 112269 /usr/lib/locale/LC_CTYPE
vim 2381 usr-4 mem REG 3,5 42897 144321 /lib/libnss_files-2.2.5.so
vim 2381 usr-4 0u CHR 136,14 16 /dev/pts/14
vim 2381 usr-4 1u CHR 136,14 16 /dev/pts/14
vim 2381 usr-4 2u CHR 136,14 16 /dev/pts/14
vim 2381 usr-4 4u REG 3,8 12288 2621444 /n/.openfiles.html.swp
[root@srv-4 usr-4]# lsof | grep 2381 | wc -l
30
[root@srv-4 fd]# ls -l /proc/2381/fd/
total 0
lrwx------ 1 usr-4 usr-4 64 Jul 30 15:16 0 -> /dev/pts/14
lrwx------ 1 usr-4 usr-4 64 Jul 30 15:16 1 -> /dev/pts/14
lrwx------ 1 usr-4 usr-4 64 Jul 30 15:16 2 -> /dev/pts/14
lrwx------ 1 usr-4 usr-4 64 Jul 30 15:16 4 -> /n/.openfiles.html.swp

Quite a difference. This process has only four open file descriptors, but there are thirty open files associated with it. Some of the open files which are not using file descriptors: library files, the program itself (executable text), and so on as listed above. These files are accounted for elsewhere in the kernel data structures (cat /proc/PID/maps to see the libraries, for instance), but they are not using file descriptors and therefore do not exhaust the kernel's file descriptor maximum.

Linux Tune Network Stack (Buffers Size) To Increase Networking Performance

I have two servers located in two different data center. Both server deals with a lot of concurrent large file transfers. But network performance is very poor for large files and performance degradation take place with a large files. How do I tune TCP under Linux to solve this problem?

By default the Linux network stack is not configured for high speed large file transfer across WAN links. This is done to save memory resources. You can easily tune Linux network stack by increasing network buffers size for high-speed networks that connect server systems to handle more network packets.

The default maximum Linux TCP buffer sizes are way too small. TCP memory is calculated automatically based on system memory; you can find the actual values by typing the following commands:
$ cat /proc/sys/net/ipv4/tcp_mem

The default and maximum amount for the receive socket memory:
$ cat /proc/sys/net/core/rmem_default
$ cat /proc/sys/net/core/rmem_max

The default and maximum amount for the send socket memory:
$ cat /proc/sys/net/core/wmem_default
$ cat /proc/sys/net/core/wmem_max

The maximum amount of option memory buffers:
$ cat /proc/sys/net/core/optmem_max

Tune values

Set the max OS send buffer size (wmem) and receive buffer size (rmem) to 12 MB for queues on all protocols. In other words set the amount of memory that is allocated for each TCP socket when it is opened or created while transferring files:
WARNING! The default value of rmem_max and wmem_max is about 128 KB in most Linux distributions, which may be enough for a low-latency general purpose network
environment or for apps such as DNS / Web server. However, if the latency is large, the default size might be too small. Please note that the following
settings going to increase memory usage on your server.
# echo 'net.core.wmem_max=12582912' >> /etc/sysctl.conf
# echo 'net.core.rmem_max=12582912' >> /etc/sysctl.conf

You also need to set minimum size, initial size, and maximum size in bytes:
# echo 'net.ipv4.tcp_rmem= 10240 87380 12582912' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_wmem= 10240 87380 12582912' >> /etc/sysctl.conf

Turn on window scaling which can be an option to enlarge the transfer window:
# echo 'net.ipv4.tcp_window_scaling = 1' >> /etc/sysctl.conf

Enable timestamps as defined in RFC1323:
# echo 'sysctl net.ipv4.tcp_sack = 1' >> /etc/sysctl.conf

By default, TCP saves various connection metrics in the route cache when the connection closes, so that connections established in the near future can use these to set initial conditions. Usually, this increases overall performance, but may sometimes cause performance degradation. If set, TCP will not cache metrics on closing connections.
# echo 'net.ipv4.tcp_no_metrics_save = 1' >> /etc/sysctl.conf

Set maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.
# echo 'net.core.netdev_max_backlog = 5000' >> /etc/sysctl.conf

Now reload the changes:
# sysctl -p

Use tcpdump to view changes for eth0:
# tcpdump -ni eth0

How do I set up remote logging?

Introduction

Remote logging allows you to send your logs from one machine to another. This enables you to keep a backup of your logs. You could also have all of your machines log to the same place. This would then give you one place to check your logs at.

In this lab, we will set up a machine as the logging server, and then we will direct our client to log to the log server.

1) For the log server, (we’ll use 192.168.0.1 as the example), we need to enable remote logging. Log on to the log server and edit /etc/sysconﬁg/syslog. Make sure that SYSLOGD_OPTIONS looks like:
SYSLOGD_OPTIONS="-r -m 0"

2) Once you have made the above change, restart syslogd on the log server:
# service syslog restart

3) On the client, edit /etc/syslog.conf. For any messages that you want to log to the log server, you will need to change the entries. For example, I have:
*.info;mail.none;authpriv.none;cron.none /var/log/messages

If I want these messages to log to the log server, I would change this line to:
*.info;mail.none;authpriv.none;cron.none @192.168.0.1

Replacing 192.168.0.1 with whatever your log server’s ip is.

4) Let’s also add a line to log user.* to our log server:
user.* @192.168.0.1

Restart syslogd on the clinet:
# service syslog restart

5) If you would like to test this setup, you can do the following:
logger -i -t user1 "Testing Logging"

At this point, you should see messages from your client in your server’s /var/log/messages.

How do I set up a Cron Job?

Introduction

In these two examples, we will set up cron jobs to output text to ﬁles.

Procedure 1

In this procedure, we will set up a cron job for the user joe that runs at 20 minutes past the hour every Wednesday in January.

1) To edit the crontab for joe, we ﬁrst need to be logged in as Joe. Once we are logged in as joe, we need to type the following:
# crontab -e

2) Now we need to type in the following cron job (Notice our use of full paths – this is -highly recommended with cron jobs):
20 * * 1 3 /bin/echo $(/bin/date) >> /home/joe/crondates

3) Save the ﬁle and exit the editor and that will set the cron job as active.

Procedure 2

In this procedure, we will set up a cron job that runs as root every hour and sends the date to /var/log/messages. This job will be set up in the /etc/cron.* structure as opposed to being setup through crontab.

1) As root, create the script /etc/cron.hourly/printdate:
#!/bin/bash
# A cron job to send the date to /var/log/messages
/bin/echo $(/bin/date) >> /var/log/messages

2) Set the script as executable:
chmod +x /etc/cron.hourly/printdate

3) At this point, the cron job is set.

Basic Command-line Date Tools

The date command can be used as follows to display the time and date:
$ date
Fri Mar 28 16:01:50 CST 2003

To see UTC/GMT, you can do this:
$ date --utc
Fri Mar 28 08:04:32 UTC 2003

The date command also can be used to set the time and date. To set the time manually, do this:
# date -s "16:15:00"
Fri Mar 28 16:15:00 CST 2003

If you also need to adjust the date, and not just the time, you can do it like this:
# date -s "16:55:30 July 7, 1986"
Mon Jul 7 16:55:30 PDT 1986

There is also another way to set the date and time, which is not very pretty:
# date 033121422003.55
Mon Mar 31 21:42:55 PST 2003

The above command does not use the -s option, and the fields are arranged like this: MMDDhhmmCCYY.ss where MM = month, DD = day, hh = hour, mm = minute, CCYY = 4 digit year, and ss = seconds.

Please note that setting the clock with the date command must be done as root. This is a "savage" way to adjust the time. It adjusts the Linux kernel system time.

There is also a hardware clock (CMOS clock). You can look at the current hardware clock time with:
hwclock --show

I always keep my hardware clocks set to UTC/GMT. This maintains my clocks uniformly without any worries about "Daylight Savings Time". This is important, because when you set the hardware clock from the system clock (kept by the Linux kernel), you need to know if this is the case. To set the hardware clock from the system clock, leaving the hardware clock in UTC, enter the following:
# hwclock --systohc --utc
# hwclock --show
Fri 28 Mar 2003 04:23:52 PM CST -0.864036 seconds

Another interesting item is that the Linux system clock stores time in seconds since midnight on January 1st, 1970 (UTC). This is called UNIX time. Unfortunately, because this is a 32-bit value, there is a year-2038 problem. Hopefully, everyone will have moved to 64-bit architectures by then. In order to see the UNIX time, you can use the following command:
# date +%s

There are many useful formatting options for the date command. See the date manpage for details.

Of course, there is another useful tool available related to date and time: cal

$ cal -3
February 2003 March 2003 April 2003
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 1 1 2 3 4 5
2 3 4 5 6 7 8 2 3 4 5 6 7 8 6 7 8 9 10 11 12
9 10 11 12 13 14 15 9 10 11 12 13 14 15 13 14 15 16 17 18 19
16 17 18 19 20 21 22 16 17 18 19 20 21 22 20 21 22 23 24 25 26
23 24 25 26 27 28 23 24 25 26 27 28 29 27 28 29 30
30 31

You can also specify "cal -y" for the entire year, "cal" by itself for the current month, or "cal 12 2005" to see the calendar for December, 2005.

[edit]
Time Zone Configuration:

Background - The Earth is divided into time zones that are 15 degrees of longitude each, for this corresponds to the amount of angular distance the Sun appears to travel in 1 hour. 0 degrees longitude runs through the Royal Observatory in Greenwich, England. This is the origin of Greenwich Mean Time, or GMT. For all practical purposes, GMT and UTC are the same. To complicate matters, some countries observe Daylight Savings Time (DST), while others do not. Even within some countries, some states or districts do not observe DST while the rest of the country does! DST can also begin and end on different days in different countries! What a mess...

There are several files and directories that are used for time zones, and several tools:
/etc/sysconfig/clock - this is a short text file that defines the timezone, whether or not the hardware clock is using UTC,
and an ARC option that is only relevant to DEC systems.
/etc/localtime - this is a symbolic link to the appropriate time zone file in /usr/share/zoneinfo
/usr/share/zoneinfo - this directory contains the time zone files that were compiled by zic. These are binary files and cannot be
viewed with a text viewer. The files contain information such as rules about DST. They allow the kernel to convert UTC UNIX time
into appropriate local dates and times.
/etc/rc.d/rc.sysinit - This script runs once, at boot time. A section of this script sets the system time from the hardware clock and applies
the local time zone information.
/etc/init.d/halt - This script runs during system shutdown. A section of this script synchronizes the hardware clock from the system clock.
/etc/adjtime - This file is used by the adjtimex function, which can smoothly adjust system time while the system runs.
settimeofday is a related function.

redhat-config-date or dateconfig - These commands start the Red Hat date/time/time zone configuration GUI. Both commands failed to change the timezone in two different stock Red Hat 8.0 systems. They also failed to create a working ntp.conf file for the NTP server. The timezone problem went away after upgrading from the installed RPM, redhat-config-date-1.5.2-10, to a newer RPM from a Red Hat beta release, redhat-config-date-1.5.9-6.
zic - (The time zone compiler) Zic creates the time conversion information files.
zdump - This utility prints the current time and date in the specified time zone. Example:

# zdump Japan
Japan Sat Mar 29 00:47:57 2003 JST

# zdump Iceland
Iceland Fri Mar 28 15:48:02 2003 GMT

# zdump /usr/share/zoneinfo/Asia/Calcutta
/usr/share/zoneinfo/Asia/Calcutta Thu May 21 13:27:25 2009 IST

In order to manually change the timezone, you can edit the /etc/sysconfig/clock file and then make a new soft link to /etc/localtime. Here is an example of changing the timezone manually to "America/Denver":

1. Select the appropriate time zone from the /usr/share/zoneinfo directory. Time zone names are relative to that directory. In this case, we will select "America/Denver"

2. Edit the /etc/sysconfig/clock text file so that it looks like this:
ZONE="America/Denver"
UTC=true
ARC=false

Of course, this assumes that your hardware clock is running UTC time...

3. Delete the following file: /etc/localtime

4. Create a new soft link for /etc/localtime. Here is an example of step 3 and step 4:
# cd /etc
# ls -al localtime
lrwxrwxrwx 1 root root 39 Mar 28 07:00 localtime -> /usr/share/zoneinfo/America/Los_Angeles
# rm /etc/localtime
# ln -s /usr/share/zoneinfo/America/Denver /etc/localtime
# ls -al localtime
lrwxrwxrwx 1 root root 34 Mar 28 08:59 localtime -> /usr/share/zoneinfo/America/Denver
# date
Fri Mar 28 09:00:04 MST 2003

The chroot command

In each of the boot problems, you will need to boot from some kind of rescue media, then work at the command line to repair the damage. If you boot from the Red Hat installation CD in rescue mode, you will need to change the root directory so that the various system directories and filesystems are in the correct locations:
chroot /mnt/sysimage

The chroot command is extremely useful for both system security and for system repair. Its basic syntax is:
chroot new-root-dir [command ...]

and its purpose is to run the specified command with the root directory changed to new-root-dir. If no command is specified, the default behaiour is to run an interactive shell (usually a bash shell). For example, the command:
chroot /var/ftp

will run a command shell in /var/ftp. However, note that the behaviour is to change the root directory first, and then try to invoke the command or shell, so that there had better be a file /var/ftp/bin/bash (which there would be, on many systems). In addition, the command will usually need to be statically linked, as otherwise it would attempt to load libraries from /lib, which is now /var/ftp/lib.

The chroot command is often used to start network daemons on servers - this is so that if an attacker manages to compromise the daemon, perhaps through a buffer overflow, he is unable to navigate around the entire system directory tree, but is instead constrained within a 'chroot jail'.

A major use of the chroot command is to change the root directory of the system after booting from a repair floppy or CD. For example, if you boot a Red Hat installation CD with the command 'linux rescue', the root file system is actually a RAM disk, and the root filesystem on your hard drive is mounted as /mnt/sysimage. Commands you give will load programs from /bin and /sbin on the RAM disk, which is obviously limited. To get access to those directories on the hard drive, you will need to change your root directory with the command
chroot /mnt/sysimage

How do I install GRUB to the MBR?

Introduction

This exercise will show you how to install GRUB to the MBR assuming that the GRUB software is installed on your computer. If GRUB is not on your computer for some reason (rpm -q grub), then you will need to install the appropriate RPM for GRUB. However, be aware that GRUB is included in the default RHEL install. If you need to install the GRUB rpm, this could be a sign of bigger problems.

Procedure

1) Our boot partition is /dev/sda1. We want to install GRUB to the master boot record of /dev/sda. This is because our BIOS passes control to /dev/sda once it has done its work. In order to do this, we need to ﬁrst determine the GRUB device name for /dev/sda by looking in /boot/grub/device.map:
(fd0) /dev/fd0
(hd0) /dev/sda
(hd1) /dev/sdb

We can tell that our GRUB device is (hd0).

2) We now need to enter the GRUB shell. To do this, we simply type grub while logged in as root:
[root@station1 ~]# grub
Probing devices to guess BIOS drives. This may take a long time.

3) After we are presented with a GRUB prompt (grub> ), we will need to specify where the boot partition is with the following command:
grub> root (hd0,0)

Notice how we used (hd0,0). In GRUB language, this identiﬁes /dev/sda1.

4) We now need to tell GRUB which disk’s master boot record it needs to install to. This is done by typing the following command:
grub> setup (hd0)

This says that we want GRUB to install to the master boot record of /dev/sda.

5) To get out of the GRUB shell, we simply type:
grub> quit

6) At this point, GRUB is now installed on /dev/sda and knows that the boot partition is /dev/sda1. It is safe to reboot.

Common boot problems

Can't Boot?

Watch the system closely as it boots, and take note of any error messages that appear. If the system complains that it is unable to mount the root filesystem, for example, this can be for any of several reasons:

* The BIOS cannot find the boot loader. This sometimes happens after you've installed Linux to dual-boot with Windows, but - out of concern
to not misconfigure the system - have asked the install program to place the boot loader in the Linux root (or /boot) filesystem.
The problem is that the BIOS can't see it there, unless you make that the active partition. The simplest fix is to reinstall
Linux and this time, let it place the LILO or GRUB boot loader into the Master Boot Record - don't worry, the Linux boot loaders are
automatically set up to let you choose Linux or Windows at boot time. It is possible to perform a more complex fix, for example by
copying the Linux boot loader sector into a file, and setting up the Windows NT/2K/XP boot loader to chain to it - but that is too
complex to describe here
(see http://www.lesbell.com.au/Home.nsf/web/Using+the+NT+Boot+Loader+to+Boot+Linux?OpenDocument where you'll find a longer article
describing how to use the NT boot loader to boot Linux).
* The kernel doesn't have a device driver to access the hard drive (e.g. a SCSI drive). Fix this by using the mkinitrd script to build a new
initrd file that contains the correct drivers, or recompile the kernel to include the driver code. This usually happens because you've
built a new kernel and slightly messed up the configuration.
* The kernel doesn't have a filesystem driver to access the root partition. For example, if the root filesystem is formatted with ext3,
then you will need the ext3 and jbd modules in the initrd or compiled into the kernel. Fix as for the previous problem. Again, this
usually happens after building a new kernel.
* The partition table has been modified, for example, by the installation of another operating system. In this case, edit the kernel
command line (in /ec/lilo.conf or /boot/grub/menu.lst) and the contents of /etc/fstab to contain the correct entries.
* Filesystems are corrupted, due to a power failure or system crash. Generally, after a system crash or power outage (what? No UPS?),
the system will come up and repair itself. If you are using a journalling filesystem like ext3fs, jfs, xfs or resiserfs, it will usually
perform a roll-forward recovery from its journal file and carry on. Even with the older ext2fs, the system usually runs an fsck
(file system check) on the various file systems and repairs them automatically. However, just occasionally manual intervention is
required - ; you might have to answer 'Y' to a string of questions (answering 'N' will get you nowhere unless you intend to
perform really low-level repairs yourself in a last-ditch attempt to avoid data loss). In the worst case, you might have to reboot from
rescue media and manuall run the e2fsck (or similar) command against each filesystem in turn. For example:

# e2fsck -p /dev/hda7

If the program complains that the superblock - the master block that links to everything else - is corrupted, it is useful to
remember that the superblock is so critical that it is duplicated every 8192 blocks through the filesystem and you can tell e2fsck to
use one of the backups:

# e2fsck -b 8193 /dev/hda7
* One or more filesystems cannot be found and mounted: Check the contents of /etc/fstab - in making quick alterations here, typographical
errors are common. You can use the e2label command to view the label of each filesystem: some distributions set these to the mount point
so you can figure out what is what.

Exploring Hardware Devices

A snapshot of all connected devices is maintained by HAL: Hardware Abstraction Layer
hal-device lists all devices in text mode
hal-device-manager displays all devices on a graphical window
lspci and lsusb list devices connected to the PCI and USB buses, respectively
The /proc and /sys filesystems also contain bus and device specific information

Hardware devices can be monitored in a number of different ways. The /proc filesystem has historically been the main method, with files such as /proc/devices, /proc/interrupts, /proc/iomem and /proc/ ioports. Buses such as the PCI and USB bus are also exposed through the /proc/bus/ directory.

To make the /proc entries more readable, utilities such as lspci and lsusb are also provided.

More recently, however, a new layer has been provided to expose hardware information: HAL (Hardware Abstraction Layer). HAL continuously maintains a snapshot of all hardware devices currently connected to the system. This snapshot may be monitored in text mode using the hal-device command, or in graphical mode with the hal-device- manager application.
# hal-device
# lspci
# lsusb

Much of the information provided through HAL can also be accessed from the /sys filesystem.

Possible commands for troubleshooting sound on linux

1. Could you please try disabling the onboard sound from BIOS settings. If the disabling of onboard sound doesn't help, please provide us the following command output,
#aplay -l
#cat /proc/asound/cards
#cat /proc/asound/modules
#cat /proc/asound/devices

2. Please run the following command,
# alsactl -f alsactl.out store 0

This will generate alsactl.out file for card ID 0. Provide us the alsactl.out file from your system to check for different settings.

3. Also, provide us the screenshot of the following command,
# alsamixer -V all

4. This is to test the speakers
# speaker-test -Dhw:0,0 -c2 -t wav

Why is the output of du and df inconsistant for some partitions

The output of du and df command may differ in some situations . For example:
# du -sh /opt
1.5G /opt

# df -hP /opt
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol05 6.0G 5.1G 594M 90% /opt

Here du shows 1.5G utilizations , whereas df shows 5.1G utilization

This behaviour could happen sometimes when a big file is deleted from the /opt partition. The inodes for the file could be reserved by a process still in memory. Restarting the related service would solve this issue.

For confirming, check the output of:
# lsof | grep '(deleted)'

In the left side of the related output entry, you could see the program which is holding the file and its pid.

You can restart the the program if possible. If there is any problem in restarting the program, you would have to restart the system.

Why does my file system gets readonly ?

The kernel will change a filesystem to read-only when it detects an I/O error against that file system. This can happen for a number of reasons:

- Your connection to your SAN dropped;

- Your hard drive(s) are dying;

- You have significant data corruption;

- and on and on...

Except for the first reason I listed, all of the other reasons are Real Bad.

If you're lucky, you've got some minor data corruption that caused the kernel to try to write beyond the end of the drive or something like that; you should try running fsck on the filesystem first. Be warned, though, that if you have significant data corruption, fsck may completely hose the filesystem, so get as good a backup as you can first.

You should check /var/log/messages for kernel messages about this. If it happens again, dmesg will also have useful information (at least, it will until you reboot).

If the problem is transient, a simple userspace mount call will fix it:
# mount -o remount,rw,usrquota /home

How to manually clean up shared memory / semaphors

This procedure may be useful for something like Oracle database, SAP or Peoplesoft when application was shutdown but shared memory hasn't been clened up for some reason.

Get the list of all semaphors for a user:
# ipcs | grep [username]

Remove (clean up) semaphor:
# ipcrm -s [semaphor id]

You can also use the command below to generate a script:
# ipcs | grep [username] | awk '{printf("ipcrm -s %s\n", $2);}'

How do you find whether the application leaks memory or not

You have a powerful tool called valgrind. Get the customer to install the appropriate debug packages. If it is net-snmp, install the correct net-snmp debuginfo packages

I will show two examples here, one for net-snmp and httpd

net-snmp
# valgrind -v --leak-check=full --show-reachable=yes --log-file=snmpd-memchk /usr/sbin/snmpd -f -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a

httpd
# valgrind -v --leak-check=full --show-reachable=yes --log-file=httpd-memchk /usr/sbin/httpd -X

Allow the process/daemons to run for a day or more. Stop the valgrind process using CTRL-C and then attach the resultant file. The information would be collected in httpd-memchk. file. What you see here is basically the leak summary

Valgrind output
==3623== LEAK SUMMARY:
==3623== definitely lost: 1,754,671 bytes in 4,074 blocks.
==3623== indirectly lost: 1,321,068,200 bytes in 38,549 blocks.
==3623== possibly lost: 12,860,528 bytes in 251 blocks.
==3623== still reachable: 18,011,981 bytes in 133,360 blocks.
==3623== suppressed: 0 bytes in 0 blocks.

before you take any decision, Please consult with a senior engineer

Huge Page Setup On Linux

Contents [hide]
1 What is hugepages and What are its advantages?
2 What is required for an application or program to be able to take advantage of hugepages?
3 Follow the following steps to enable Oracle to start using Hugepage.
4 How do I set up hugepages in Red Hat Enterprise Linux 4?

[edit]
What is hugepages and What are its advantages?

Hugepages is a mechanism that allows the Linux kernel to utilise the multiple page size capabilities of modern hardware architectures. Linux uses pages as the basic unit of memory – physical memory is partitioned and accessed using the basic page unit. The default page size is 4096 Bytes in the x86 architecture.

Hugepages allows large amounts of memory to be utilised with a reduced overhead. Linux uses a mechanism in the CPU architecture called “Transaction Lookaside Buffers” (TLB). These buffers contain mappings of virtual memory to actual physical memory addresses. The TLB is a limited hardware resource, so utilising a huge amount of physical memory with the default page size consumes the TLB and adds processing overhead.

The Linux kernel is able to set aside a portion of physical memory to be able be addressed using a larger page size. Since the page size is higher, there will be less overhead managing the pages with the TLB.

In the Linux 2.6 series of kernels, hugepages is enabled using the CONFIG_HUGETLB_PAGE feature when the kernel is built. All kernels supplied by Red Hat for the Red Hat Enterprise Linux 4 release have the feature enabled.

Systems with large amount of memory can be configured to utilise the memory more efficiently by setting aside a portion dedicated for hugepages. The actual size of the page is dependent on the system architecture. A typical x86 system will have a Huge Page Size of 2048 kBytes. The huge page size may be found by looking at the /proc/meminfo :
# cat /proc/meminfo |grep Hugepagesize
Hugepagesize: 2048 kB

[edit]
What is required for an application or program to be able to take advantage of hugepages?

To take advantage of hugepages, an application needs to use the shmget() or shmat() type system calls. Examples of the use of the system call are available in the kernel-doc package in the file /usr/share/doc/kernel-doc-2.6.9/Documentation/vm/hugetlbpage.txt

An alternative of using shmget() type system calls is to mount a hugetlbfs file system at a mount point. With a hugetlbfs mounted, an application then utilities the mmap() system call to map memory to the mount point. Details on how this is done, including both specific mount commands, and code that illustrates the system calls in use are also in the file /usr/share/doc/kernel-doc-2.6.9/Documentation/vm/hugetlbpage.txt

[edit]
Follow the following steps to enable Oracle to start using Hugepage.

1.Hugepage counting formula
(SGA_SIZE/2M ) + 100

So for 10G SGA hugepage should be set to ,
((10*1024)M / 2M ) + 100 = 5220

You can find SGA size from show sga command on sqlplus prompt.

2.Please put following kernel parameter into /etc/sysctl.conf file.
vm.nr_hugepages = 5220

3.Run the following command.
#sysctl -p

4. Modify /etc/security/limits.conf file to have following entry.
oracle soft memlock 20086560
oracle hard memlock 20086560

Then Reboot the machine.

5.Check the Hugepage Total and Hugepage Free using following command.
#cat /proc/meminfo | grep -i huge

This should return like following.
HugePages_Total: 5220
HugePages_Free: 5220
HugePages_Rsvd: 00
Hugepagesize: 2048 kB

Also check the memlock setting using ulimit -a command.

6.Start the Oracle DB and check whether Hugepage is getting allocated or not by using following command.
#cat /proc/meminfo | grep -i huge

[edit]
How do I set up hugepages in Red Hat Enterprise Linux 4?

Hugepages can be allocated using the /proc/sys/vm/nr_hugepages entry, or by using the sysctl command.

To view the current setting using the /proc entry:
# cat /proc/sys/vm/nr_hugepages
0

To view the current setting using the sysctl command:
# sysctl vm.nr_hugepages
vm.nr_hugepages = 0

To set the number of huge pages using /proc entry:
# echo 5 > /proc/sys/vm/nr_hugepages

To set the number of hugepages using sysctl :
# sysctl -w vm.nr_hugepages=5
vm.nr_hugepages = 5

It may be necessary to reboot to be able to allocate all the hugepages that is needed. This is because hugepages requires large areas of contiguous physical memory. Over time, physical memory may be mapped and allocated to pages, thus the physical memory can become fragmented. If the hugepages are allocated early in the boot process, fragmentation is unlikely to have occurred.

It is recommended that the /etc/sysctl.conf file should be used to allocate hugepages at boot time. For example, to allocate 5 hugepages at boot time add the line below to the sysctl.conf file :
vm.nr_hugepages = 5

Can I free memory by flushing the pagecache and the slabcache in Red Hat Enterprise Linux?

http://kbase.redhat.com/faq/docs/DOC-5411

http://kbase.redhat.com/faq/docs/DOC-7055

http://kbase.redhat.com/faq/docs/DOC-7986

http://kbase.redhat.com/faq/docs/DOC-4302

http://kbase.redhat.com/faq/docs/DOC-3368

> For example with 8GBs of total memory the Cache is using 6GBs. I would like to restrict the cache to 2 or 3 GBs. Well , there is no direct method by which you can set restrict the cache to 2 or 3 GB's. Since this is entirely maintained by the Linux kernel.

You can still tune the following parameters of /etc/sysctl.conf in such a way that the kernel does less page caching.

vm.pagecache vm.vfs_cache_pressure vm.drop_caches vm.swappiness

1) The following command can be used to free the pagecache:
echo 1 > /proc/sys/vm/drop_caches

You can see the effect by comparing the output of "free -m" before and after running the above command

For more details: - http://kbase.redhat.com/faq/docs/DOC-5411

2) Setting the vfs_cache_pressure value higher than 100 increases the kernel's willingness to reap caches of filesystem metadata. Kernel frees lowmem more frequently thus reducing fragmentation.

vm.vfs_cache_pressure=200

For more details: - http://kbase.redhat.com/faq/docs/DOC-4302

3) Lowering vm.swappiness will make the system reclaim page cache first, before swapping. A value of 20 should be fine, maybe even lower.
# echo '20'> /proc/sys/vm/swappiness

For more details check : - http://kbase.redhat.com/faq/docs/DOC-3368

Let me know if you need further explanation.

Understanding memory usage on Linux

This entry is for those people who have ever wondered, "Why the hell is a simple KDE text editor taking up 25 megabytes of memory?" Many people are led to believe that many Linux applications, especially KDE or Gnome programs, are "bloated" based solely upon what tools like ps report. While this may or may not be true, depending on the program, it is not generally true -- many programs are much more memory efficient than they seem.

Contents [hide]
1 What ps reports
2 Why ps is "wrong"
3 Seeing a process's memory map
4 What does it all mean?

[edit]
What ps reports

The ps tool can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size", which are commonly used by geeks around the world to see how much memory processes are taking up.

For example, here is the output of ps aux for KEdit on my computer:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dbunker 3468 0.0 2.7 25400 14452 ? S 20:19 0:00 kdeinit: kedit

According to ps, KEdit has a virtual size of about 25 megabytes and a resident size of about 14 megabytes (both numbers above are reported in kilobytes). It seems that most people like to randomly choose to accept one number or the other as representing the real memory usage of a process. I'm not going to explain the difference between VSZ and RSS right now but, needless to say, this is the wrong approach; neither number is an accurate picture of what the memory cost of running KEdit is.

[edit]
Why ps is "wrong"

Depending on how you look at it, ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely "wrong". In order to understand why, it is necessary to learn how Linux handles shared libraries in programs.

Most major programs on Linux use shared libraries to facilitate certain functionality. For example, a KDE text editing program will use several KDE shared libraries (to allow for interaction with other KDE components), several X libraries (to allow it to display images and copy and pasting), and several general system libraries (to allow it to perform basic operations). Many of these shared libraries, especially commonly used ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to use a great trick: it will load a single copy of the shared libraries into memory and use that one copy for every program that references it.

For better or worse, many tools don't care very much about this very common trick; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Not only do you need to understand how the system really works, but you need to decide how you want to deal with some hard questions. Should a shared library that is only needed for one process be counted in that process's memory usage? If a shared library is used my multiple processes, should its memory usage be evenly distributed among the different processes, or just ignored? There isn't a hard and fast rule here; you might have different answers depending on the situation you're facing. It's easy to see why ps doesn't try harder to report "correct" memory usage totals, given the ambiguity.

[edit]
Seeing a process's memory map

Enough talk; let's see what the situation is with that "huge" KEdit process. To see what KEdit's memory looks like, we'll use the pmap program (with the -d flag):
# pmap -d 489
Address Kbytes Mode Offset Device Mapping
08048000 40 r-x-- 0000000000000000 0fe:00000 kdeinit
08052000 4 rw--- 0000000000009000 0fe:00000 kdeinit
08053000 1164 rw--- 0000000008053000 000:00000 [ anon ]
40000000 84 r-x-- 0000000000000000 0fe:00000 ld-2.3.5.so
40015000 8 rw--- 0000000000014000 0fe:00000 ld-2.3.5.so
40017000 4 rw--- 0000000040017000 000:00000 [ anon ]
40018000 4 r-x-- 0000000000000000 0fe:00000 kedit.so
40019000 4 rw--- 0000000000000000 0fe:00000 kedit.so
40027000 252 r-x-- 0000000000000000 0fe:00000 libkparts.so.2.1.0
40066000 20 rw--- 000000000003e000 0fe:00000 libkparts.so.2.1.0
... (trimmed) ...
mapped: 25404K writeable/private: 2432K shared: 0K

I cut out a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to the discussion.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of KEdit (assuming that all of the shared libraries were already loaded) is around 2 megabytes. That is quite a different story from the 14 or 25 megabytes that ps reported.

[edit]
What does it all mean?

The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critial when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle (although see one of my past postings for another way of increasing Apache's performance).

It also shows that it pays to stick with one desktop's software as much as possible. If you run KDE for your desktop, but mostly use Gnome applications, then you are paying a large price for a lot of redundant (but different) shared libraries. By sticking to just KDE or just Gnome apps as much as possible, you reduce your overall memory usage due to the reduced marginal memory cost of running new KDE or Gnome applications, which allows Linux to use more memory for other interesting things (like the file cache, which speeds up file accesses immensely).

reference : http://virtualthreads.blogspot.com/2006/02/understanding-memory-usage-on-linux.html

Why doesn't free memory go down

Why Linux always seems to have so little free memory. Does this indicate some kind of problem in Linux or the application? No.

Someone at work asked (paraphrasing):
I have a process that uses a lot of memory while it's running, so the free memory (shown by free or top) goes right down to 60MB out of 8100MB.
But when the process exits, the free memory doesn't go back up. Why isn't memory released when the process exits?

The short answer is that you should never worry about the amount of free memory on Linux. The kernel attempts to keep this slightly above zero by keeping the cache as large as possible. This is a feature not a bug.

If you are concerned about VM performance then the most useful thing to watch is the page in/out rate, shown by the "bi" and "bo" columns in vmstat. Another useful measure (2.6 only) is the "wa" column, showing the amount of CPU time spent waiting for I/O. "wa" is probably the one you have to worry about most, because it shows CPU cycles that are essentially wasted because VM is too slow.
[root@sbandodk ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 201800 144024 974952 0 0 41 31 1098 896 14 3 83 1 0
As you said, linux is keeping the free memory into buffer cache, but when there is no process running how come the buffer cache is having 4GB and
how it is released 3GB to free memory.

Disk cache is maintained globally, not per-process. Files can remain in cache even after the process that was using them exited, because they might be used by another process. Freeing the cache would mean discarding cached data. There's no reason to do that until the data is obsolete (e.g. files are deleted) or the memory is needed for some other purpose.

After a while the free memory goes back up again.

Pages only become free when they're evicted to build up the free pool (see below), or when nothing useful can be stored in them. If there are gigabytes of free memory then the main cause is that the kernel doesn't have anything to cache in them.

This can happen when, for example, a file that was cached was deleted, or a filesystem is unmounted: there's no point keeping those pages cached because they can't be accessed. (Note that the kernel can still cache a file which is just unlinked, but still in use by applications.)

A similar case is that an application has allocated a lot of anonymous memory and then either exited or freed the memory. That data is discarded, so the pages are free.

Note that flushing the data to disk makes the pages clean, but not free. They can still be kept in memory in case they're read in the future. ("Clean" means the in-memory page is the same as the on-disk page.)

The guy in the second row asks:
So if Linux tries to keep the cache as large as possible, why is there 60MB free rather than zero?
Wouldn't it be better to cache an additional 60MB?

Linux keeps a little bit of memory free so that it is ready as soon as it needs to allocate more memory. If the extra 60MB was used for cache too then when a new allocation was required, the kernel would have to go through the cache and work out what to evict. Possibly it would need to wait for a page to be written out. This would make allocation slower and more complex. So there is a tradeoff where the page cache is made slightly slower so that allocation can be faster and simpler. The kernel keeps just a few free pages prepared in advance.

How can I dedicate one or more CPU's on a multi-processor system for exclusive use by one or more specific applications?

http://kbase.redhat.com/faq/docs/DOC-15596

Why does top output for cpu utilization differ from mpstat on smp systems

This can me reproduced by running the following command
# rpm -Vq > /tmp/rpm.txt &
# top
# mpstat

Notice, after you run the top command, press "1" to get a list of all cores. This will show the actual cup utilization

Example:

Top command run without pressing "1"
Cpu(s): 9.5%us, 12.0%sy, 0.0%ni, 77.9%id, 0.4%wa, 0.0%hi, 0.0%si, 0.2%st

Top command run after pressing "1"
Cpu0 : 2.0%us, 8.0%sy, 0.0%ni, 77.1%id, 12.6%wa, 0.0%hi, 0.0%si, 0.3%st
Cpu1 : 3.3%us, 7.3%sy, 0.0%ni, 89.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
Cpu2 : 18.2%us, 13.9%sy, 0.0%ni, 66.9%id, 0.0%wa, 0.0%hi, 0.0%si, 1.0%st
Cpu3 : 7.3%us, 12.6%sy, 0.0%ni, 80.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Output of mpstat command

[root@xen ~]# mpstat -P ALL 1
Linux 2.6.18-128.1.10.el5xen (xen.pnq.redhat.com) 05/23/2009

04:56:19 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
04:56:20 AM all 9.09 0.00 10.32 0.74 0.00 0.00 0.49 79.36 4929.70
04:56:20 AM 0 10.89 0.00 8.91 0.00 0.00 0.00 0.99 79.21 1289.11
04:56:20 AM 1 2.97 0.00 11.88 2.97 0.00 0.00 0.99 81.19 2083.17
04:56:20 AM 2 13.86 0.00 10.89 0.00 0.00 0.00 0.99 74.26 817.82
04:56:20 AM 3 8.49 0.00 10.38 0.00 0.00 0.00 0.00 81.13 727.72

How do I Find Out Linux CPU Utilization?

Whenever a Linux system CPU is occupied by a process, it is unavailable for processing other requests. Rest of pending requests must wait till CPU is free. This becomes a bottleneck in the system. Following command will help you to identify CPU utilization, so that you can troubleshoot CPU related performance problems.

Finding CPU utilization is one of the important tasks. Linux comes with various utilities to report CPU utilization. With these commands, you will be able to find out:
CPU utilization
Display the utilization of each CPU individually (SMP cpu)
Find out your system's average CPU utilization since the last reboot etc
Determine which process is eating the CPU(s)

[edit]
Old good top command to find out Linux cpu load

Top command to find out Linux cpu usage

Type the top command:
$ top

[edit]
Find Linux CPU utilization using mpstat and other tools

Please note that you need to install special package called sysstat to take advantage of following commands. This package includes system performance tools for Linux (Red Hat Linux / RHEL includes these tools by default).
apt-get install sysstat

Use up2date command if you are using RHEL:
up2date sysstat

Tuesday, January 5, 2010

Understanding memory usage on Linux

This entry is for those people who have ever wondered, "Why the hell is a simple KDE text editor taking up 25 megabytes of memory?" Many people are led to believe that many Linux applications, especially KDE or Gnome programs, are "bloated" based solely upon what tools like ps report. While this may or may not be true, depending on the program, it is not generally true -- many programs are much more memory efficient than they seem.

What ps reports

The ps tool can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size", which are commonly used by geeks around the world to see how much memory processes are taking up.

For example, here is the output of ps aux for KEdit on my computer:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dbunker 3468 0.0 2.7 25400 14452 ? S 20:19 0:00 kdeinit: kedit

According to ps, KEdit has a virtual size of about 25 megabytes and a resident size of about 14 megabytes (both numbers above are reported in kilobytes). It seems that most people like to randomly choose to accept one number or the other as representing the real memory usage of a process. I'm not going to explain the difference between VSZ and RSS right now but, needless to say, this is the wrong approach; neither number is an accurate picture of what the memory cost of running KEdit is.

Why ps is "wrong"

Depending on how you look at it, ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely "wrong". In order to understand why, it is necessary to learn how Linux handles shared libraries in programs.

Most major programs on Linux use shared libraries to facilitate certain functionality. For example, a KDE text editing program will use several KDE shared libraries (to allow for interaction with other KDE components), several X libraries (to allow it to display images and copy and pasting), and several general system libraries (to allow it to perform basic operations). Many of these shared libraries, especially commonly used ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to use a great trick: it will load a single copy of the shared libraries into memory and use that one copy for every program that references it.

For better or worse, many tools don't care very much about this very common trick; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Not only do you need to understand how the system really works, but you need to decide how you want to deal with some hard questions. Should a shared library that is only needed for one process be counted in that process's memory usage? If a shared library is used my multiple processes, should its memory usage be evenly distributed among the different processes, or just ignored? There isn't a hard and fast rule here; you might have different answers depending on the situation you're facing. It's easy to see why ps doesn't try harder to report "correct" memory usage totals, given the ambiguity.

Seeing a process's memory map

Enough talk; let's see what the situation is with that "huge" KEdit process. To see what KEdit's memory looks like, we'll use the pmap program (with the -d flag):
# pmap -d 489
Address Kbytes Mode Offset Device Mapping
08048000 40 r-x-- 0000000000000000 0fe:00000 kdeinit
08052000 4 rw--- 0000000000009000 0fe:00000 kdeinit
08053000 1164 rw--- 0000000008053000 000:00000 [ anon ]
40000000 84 r-x-- 0000000000000000 0fe:00000 ld-2.3.5.so
40015000 8 rw--- 0000000000014000 0fe:00000 ld-2.3.5.so
40017000 4 rw--- 0000000040017000 000:00000 [ anon ]
40018000 4 r-x-- 0000000000000000 0fe:00000 kedit.so
40019000 4 rw--- 0000000000000000 0fe:00000 kedit.so
40027000 252 r-x-- 0000000000000000 0fe:00000 libkparts.so.2.1.0
40066000 20 rw--- 000000000003e000 0fe:00000 libkparts.so.2.1.0
... (trimmed) ...
mapped: 25404K writeable/private: 2432K shared: 0K

I cut out a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to the discussion.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of KEdit (assuming that all of the shared libraries were already loaded) is around 2 megabytes. That is quite a different story from the 14 or 25 megabytes that ps reported.

What does it all mean?

The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critial when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle (although see one of my past postings for another way of increasing Apache's performance).

It also shows that it pays to stick with one desktop's software as much as possible. If you run KDE for your desktop, but mostly use Gnome applications, then you are paying a large price for a lot of redundant (but different) shared libraries. By sticking to just KDE or just Gnome apps as much as possible, you reduce your overall memory usage due to the reduced marginal memory cost of running new KDE or Gnome applications, which allows Linux to use more memory for other interesting things (like the file cache, which speeds up file accesses immensely).

Why doesn't free memory go down

Why Linux always seems to have so little free memory. Does this indicate some kind of problem in Linux or the application? No.

Someone at work asked (paraphrasing):
I have a process that uses a lot of memory while it's running, so the free memory (shown by free or top) goes right down to 60MB out of 8100MB.
But when the process exits, the free memory doesn't go back up. Why isn't memory released when the process exits?

The short answer is that you should never worry about the amount of free memory on Linux. The kernel attempts to keep this slightly above zero by keeping the cache as large as possible. This is a feature not a bug.

If you are concerned about VM performance then the most useful thing to watch is the page in/out rate, shown by the "bi" and "bo" columns in vmstat. Another useful measure (2.6 only) is the "wa" column, showing the amount of CPU time spent waiting for I/O. "wa" is probably the one you have to worry about most, because it shows CPU cycles that are essentially wasted because VM is too slow.

[root@sbandodk ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 201800 144024 974952 0 0 41 31 1098 896 14 3 83 1 0

As you said, linux is keeping the free memory into buffer cache, but when there is no process running how come the buffer cache is having 4GB and
how it is released 3GB to free memory.

Disk cache is maintained globally, not per-process. Files can remain in cache even after the process that was using them exited, because they might be used by another process. Freeing the cache would mean discarding cached data. There's no reason to do that until the data is obsolete (e.g. files are deleted) or the memory is needed for some other purpose.

After a while the free memory goes back up again.

Pages only become free when they're evicted to build up the free pool (see below), or when nothing useful can be stored in them. If there are gigabytes of free memory then the main cause is that the kernel doesn't have anything to cache in them.

This can happen when, for example, a file that was cached was deleted, or a filesystem is unmounted: there's no point keeping those pages cached because they can't be accessed. (Note that the kernel can still cache a file which is just unlinked, but still in use by applications.)

A similar case is that an application has allocated a lot of anonymous memory and then either exited or freed the memory. That data is discarded, so the pages are free.

Note that flushing the data to disk makes the pages clean, but not free. They can still be kept in memory in case they're read in the future. ("Clean" means the in-memory page is the same as the on-disk page.)

The guy in the second row asks:
So if Linux tries to keep the cache as large as possible, why is there 60MB free rather than zero?
Wouldn't it be better to cache an additional 60MB?

Linux keeps a little bit of memory free so that it is ready as soon as it needs to allocate more memory. If the extra 60MB was used for cache too then when a new allocation was required, the kernel would have to go through the cache and work out what to evict. Possibly it would need to wait for a page to be written out. This would make allocation slower and more complex. So there is a tradeoff where the page cache is made slightly slower so that allocation can be faster and simpler. The kernel keeps just a few free pages prepared in advance.

Linux Kernelogic

Search This Blog

Thursday, May 27, 2010

How Many Open Files?

Linux Tune Network Stack (Buffers Size) To Increase Networking Performance

How do I set up remote logging?

How do I set up a Cron Job?

Basic Command-line Date Tools

The chroot command

How do I install GRUB to the MBR?

Common boot problems

Exploring Hardware Devices

Possible commands for troubleshooting sound on linux

Why is the output of du and df inconsistant for some partitions

Why does my file system gets readonly ?

How to manually clean up shared memory / semaphors

How do you find whether the application leaks memory or not

Huge Page Setup On Linux

Can I free memory by flushing the pagecache and the slabcache in Red Hat Enterprise Linux?

Understanding memory usage on Linux

Why doesn't free memory go down

How can I dedicate one or more CPU's on a multi-processor system for exclusive use by one or more specific applications?

Why does top output for cpu utilization differ from mpstat on smp systems

How do I Find Out Linux CPU Utilization?

Tuesday, January 5, 2010

Understanding memory usage on Linux

Why doesn't free memory go down

Followers

Blog Archive

About Me