Troubleshooting Linux Servers
top and Other Basic System Tools
You thought you had it all working, didn't you? But then your users report slowdowns, or your logfiles are empty, or jobs don't run-- so how do you find out what's going on?You thought you had it all working, didn't you? And then you find out that your process you thought was running and collecting data hasn't reported anything for two hours. Or maybe it's something on the desktop -- your browser has frozen and isn't responding. Or suddenly everything's gotten really slow and you're not sure why. And this happens every few days, and you're tired of it.
How do you find out what's going on in your running processes?
top and other basic system tools
All the techniques discussed here require a process ID. If you know the name of the process that's stuck or running wild, you can get its PID with ps aux | grep processname. Otherwise, you can usually find high-CPU processes with top:
Tasks: 114 total, 1 running, 113 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.2%us, 0.6%sy, 0.6%ni, 96.0%id, 1.6%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4053756k total, 1059196k used, 2994560k free, 305236k buffers
Swap: 2249060k total, 0k used, 2249060k free, 465112k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3055 akkana 20 0 160m 39m 18m S 39 1.0 0:02.83 plugin-containe
2223 akkana 20 0 330m 107m 26m S 16 2.7 0:51.33 firefox-bin
65 root 20 0 0 0 0 S 2 0.0 0:00.34 kondemand/0
1586 root 20 0 71712 22m 8244 S 2 0.6 0:24.87 Xorg
1 root 20 0 2748 1612 1216 S 0 0.0 0:00.37 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
...and so on
By default, top starts with the processes that are eating the most CPU. In this case, Firefox isn't stuck, but it is running flash, so the browser and its helper app are together taking up 45% of CPU. That's not a killer, but if the system is slow and you see a process using around 99% CPU, you've found the culprit.
Once you've identified the process, how do you find out more about what it's doing?
strace
strace is a useful program that shows system calls as they happen.
System calls include file operations like read, write and open, timeouts and signals, network operations, and assorted other ways to get or set system information. You can read a general overview with man 2 intro or a long list of all available calls with man 2 syscalls.
This all may sound a bit arcane, but sometimes watching strace output will tell you why a program is failing -- maybe it's waiting for the network, or repeatedly trying to open a file that doesn't exist.
You can run a program under strace, e.g. strace firefox. But more often, you'll want to attach to a process that's already running. Get the process ID from ps or top, then use strace -p.
Suppose I have a process that seems to have hung: top says it's not using any CPU, but it's stuck and hasn't done anything in half an hour.
$ strace -p 3672 Process 3672 attached - interrupt to quit recv(3,... strace just stops there, with the cursor in the middle of a line. What's up?
The program is waiting for the recv system call. Hit Ctrl-C to exit strace, then use apropos:
$ apropos recv recv (2) - receive a message from a socket recvfrom (2) - receive a message from a socket recvmsg (2) - receive a message from a socket
So the process is waiting to read something from a network socket. That's some progress, anyway.
Wait, how do you test this stuff?
As you build up a library of diagnostic tools, you may sometimes wish you had an easier way to experiment with them. It's also handy if you're writing articles! Naturally, when I wanted a program to misbehave so I could show how to debug it, everything on my system worked perfectly. What's a poor girl to do?
Write a misbehaving program! It's easy to simulate a network hang if you have a web server handy. On the server side, write a script like this one:
#! /usr/bin/env python
import time
print """Content-Type: text/html
Hello, world. Now we'll hang for a bit ...
"""
for i in range(50) : # Don't run forever and clog up the server
time.sleep(300) # sleep for 5 minutes
print "\nAnother line"
You can test it with wget or curl, or write a Python script:
#!/usr/bin/env python
import urllib2
response = urllib2.urlopen("http://example.com/testcgi/index.cgi")
Of course, if you just want a program to take up all available CPU, just type something like this into your bash shell, or the equivalent in any other programming language:
while /bin/true; do echo x done
- Skip Ahead
- 1. top and Other Basic System Tools
- 2. top and Other Basic System Tools
Solid state disks (SSDs) made a splash in consumer technology, and now the technology has its eyes on the enterprise storage market. Download this eBook to see what SSDs can do for your infrastructure and review the pros and cons of this potentially game-changing storage technology.