April 25, 2019

Troubleshooting Linux Servers

top and Other Basic System Tools

  • November 24, 2010
  • By Akkana Peck
You thought you had it all working, didn't you? But then your users report slowdowns, or your logfiles are empty, or jobs don't run-- so how do you find out what's going on?
You thought you had it all working, didn't you? And then you find out that your process you thought was running and collecting data hasn't reported anything for two hours. Or maybe it's something on the desktop -- your browser has frozen and isn't responding. Or suddenly everything's gotten really slow and you're not sure why. And this happens every few days, and you're tired of it.

How do you find out what's going on in your running processes?

top and other basic system tools

All the techniques discussed here require a process ID. If you know the name of the process that's stuck or running wild, you can get its PID with ps aux | grep processname. Otherwise, you can usually find high-CPU processes with top:
Tasks: 114 total,   1 running, 113 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.2%us,  0.6%sy,  0.6%ni, 96.0%id,  1.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4053756k total,  1059196k used,  2994560k free,   305236k buffers
Swap:  2249060k total,        0k used,  2249060k free,   465112k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 3055 akkana    20   0  160m  39m  18m S   39  1.0   0:02.83 plugin-containe    
 2223 akkana    20   0  330m 107m  26m S   16  2.7   0:51.33 firefox-bin        
   65 root      20   0     0    0    0 S    2  0.0   0:00.34 kondemand/0        
 1586 root      20   0 71712  22m 8244 S    2  0.6   0:24.87 Xorg               
    1 root      20   0  2748 1612 1216 S    0  0.0   0:00.37 init               
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd           
    3 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0        
...and so on

By default, top starts with the processes that are eating the most CPU. In this case, Firefox isn't stuck, but it is running flash, so the browser and its helper app are together taking up 45% of CPU. That's not a killer, but if the system is slow and you see a process using around 99% CPU, you've found the culprit.

Once you've identified the process, how do you find out more about what it's doing?


strace is a useful program that shows system calls as they happen.

System calls include file operations like read, write and open, timeouts and signals, network operations, and assorted other ways to get or set system information. You can read a general overview with man 2 intro or a long list of all available calls with man 2 syscalls.

This all may sound a bit arcane, but sometimes watching strace output will tell you why a program is failing -- maybe it's waiting for the network, or repeatedly trying to open a file that doesn't exist.

You can run a program under strace, e.g. strace firefox. But more often, you'll want to attach to a process that's already running. Get the process ID from ps or top, then use strace -p.

Suppose I have a process that seems to have hung: top says it's not using any CPU, but it's stuck and hasn't done anything in half an hour.

$ strace -p 3672
Process 3672 attached - interrupt to quit
... strace just stops there, with the cursor in the middle of a line. What's up?

The program is waiting for the recv system call. Hit Ctrl-C to exit strace, then use apropos:

$ apropos recv
recv (2)             - receive a message from a socket
recvfrom (2)         - receive a message from a socket
recvmsg (2)          - receive a message from a socket

So the process is waiting to read something from a network socket. That's some progress, anyway.

Wait, how do you test this stuff?

As you build up a library of diagnostic tools, you may sometimes wish you had an easier way to experiment with them. It's also handy if you're writing articles! Naturally, when I wanted a program to misbehave so I could show how to debug it, everything on my system worked perfectly. What's a poor girl to do?

Write a misbehaving program! It's easy to simulate a network hang if you have a web server handy. On the server side, write a script like this one:

#! /usr/bin/env python

import time

print """Content-Type: text/html

Hello, world. Now we'll hang for a bit ...
for i in range(50) :   # Don't run forever and clog up the server
    time.sleep(300)    # sleep for 5 minutes
    print "

\nAnother line"

You can test it with wget or curl, or write a Python script:

#!/usr/bin/env python

import urllib2

response = urllib2.urlopen("http://example.com/testcgi/index.cgi")

Of course, if you just want a program to take up all available CPU, just type something like this into your bash shell, or the equivalent in any other programming language:

while /bin/true; do
  echo x

Most Popular LinuxPlanet Stories