Troubleshooting Linux Servers
top and Other Basic System ToolsYou thought you had it all working, didn't you? But then your users report slowdowns, or your logfiles are empty, or jobs don't run-- so how do you find out what's going on?
You thought you had it all working, didn't you? And then you find out that your process you thought was running and collecting data hasn't reported anything for two hours. Or maybe it's something on the desktop -- your browser has frozen and isn't responding. Or suddenly everything's gotten really slow and you're not sure why. And this happens every few days, and you're tired of it.
How do you find out what's going on in your running processes?
top and other basic system tools
Tasks: 114 total, 1 running, 113 sleeping, 0 stopped, 0 zombie Cpu(s): 1.2%us, 0.6%sy, 0.6%ni, 96.0%id, 1.6%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 4053756k total, 1059196k used, 2994560k free, 305236k buffers Swap: 2249060k total, 0k used, 2249060k free, 465112k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3055 akkana 20 0 160m 39m 18m S 39 1.0 0:02.83 plugin-containe 2223 akkana 20 0 330m 107m 26m S 16 2.7 0:51.33 firefox-bin 65 root 20 0 0 0 0 S 2 0.0 0:00.34 kondemand/0 1586 root 20 0 71712 22m 8244 S 2 0.6 0:24.87 Xorg 1 root 20 0 2748 1612 1216 S 0 0.0 0:00.37 init 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 ...and so on
By default, top starts with the processes that are eating the most CPU. In this case, Firefox isn't stuck, but it is running flash, so the browser and its helper app are together taking up 45% of CPU. That's not a killer, but if the system is slow and you see a process using around 99% CPU, you've found the culprit.
Once you've identified the process, how do you find out more about what it's doing?
strace is a useful program that shows system calls as they happen.
System calls include file operations like read, write and open, timeouts and signals, network operations, and assorted other ways to get or set system information. You can read a general overview with man 2 intro or a long list of all available calls with man 2 syscalls.
This all may sound a bit arcane, but sometimes watching strace output will tell you why a program is failing -- maybe it's waiting for the network, or repeatedly trying to open a file that doesn't exist.
You can run a program under strace, e.g. strace firefox. But more often, you'll want to attach to a process that's already running. Get the process ID from ps or top, then use strace -p.
Suppose I have a process that seems to have hung: top says it's not using any CPU, but it's stuck and hasn't done anything in half an hour.
$ strace -p 3672 Process 3672 attached - interrupt to quit recv(3,... strace just stops there, with the cursor in the middle of a line. What's up?
The program is waiting for the recv system call. Hit Ctrl-C to exit strace, then use apropos:
$ apropos recv recv (2) - receive a message from a socket recvfrom (2) - receive a message from a socket recvmsg (2) - receive a message from a socket
So the process is waiting to read something from a network socket. That's some progress, anyway.
Wait, how do you test this stuff?
As you build up a library of diagnostic tools, you may sometimes wish you had an easier way to experiment with them. It's also handy if you're writing articles! Naturally, when I wanted a program to misbehave so I could show how to debug it, everything on my system worked perfectly. What's a poor girl to do?
Write a misbehaving program! It's easy to simulate a network hang if you have a web server handy. On the server side, write a script like this one:
#! /usr/bin/env python import time print """Content-Type: text/html Hello, world. Now we'll hang for a bit ... """ for i in range(50) : # Don't run forever and clog up the server time.sleep(300) # sleep for 5 minutes print "
You can test it with wget or curl, or write a Python script:
#!/usr/bin/env python import urllib2 response = urllib2.urlopen("http://example.com/testcgi/index.cgi")
Of course, if you just want a program to take up all available CPU, just type something like this into your bash shell, or the equivalent in any other programming language:
while /bin/true; do echo x done
Sponsored by BlackBerry
BlackBerry® Enterprise Server Express enables businesses of any size to quickly and easily get started with the BlackBerry solution. It provides advanced BlackBerry smartphone features with no additional software or user license fees, and works with any Internet-enabled BlackBerry data plan or a BlackBerry enterprise data plan. Download now!