Advanced Linux Server Troubleshooting (part 2)
All about /proc, and how to debug PythonYou know the basics of how to find out what an errant process is doing. But what do you do when the basics aren't enough?
You know the basics of how to find out what an errant process is doing -- or if not, you may want to review Troubleshooting part I first.
Getting info without gdb or strace: /proc
Even if you don't have any system tools installed, you can get a lot of the same information from the /proc filesystem.
Suppose you have a process that seems frozen and you want to know why. Get its process ID with ps and grep:
$ ps au | grep getstuff akkana 9997 0.3 0.1 8444 4920 pts/2 S+ 20:12 0:00 python /home/akkana/bin/getstuff.py
Armed with the process ID, cd /proc/processID and poke around.
$ cd /proc/9997 $ ls attr/ cpuset io mountinfo pagemap smaps task/ auxv cwd@ latency mounts personality stack wchan cgroup environ limits mountstats root@ stat clear_refs exe@ loginuid net/ sched statm cmdline fd/ maps oom_adj schedstat status coredump_filter fdinfo/ mem oom_score sessionid syscall
There's all sorts of useful information there. For instance, status tells you the current state of the process. environ gives you the environment variables the process is using, like ps aueww -- not useful in this case, but you may need it some day.
ls -l fd will show you all the files the process has open and where they live on the file system. It also tells you about other file-like objects like network sockets.
The net directory is full of arcane information mostly of use to networking gurus, but even the rest of us can sometimes glean useful information:
$ cat net/sockstat sockets: used 49 TCP: inuse 3 orphan 0 tw 0 alloc 4 mem 2 UDP: inuse 2 mem 2 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0 $ cat net/wireless Inter-| sta-| Quality | Discarded packets | Missed | WE face | tus | link level noise | nwid crypt frag retry misc | beacon | 22 mlan0: 0001 0 197 188 0 0 9342130 506582 1258102 0
Hmm, lots of retries. That doesn't look good! Might point to a problem.
But in our current predicament, head straight to stack to get a current stack trace, similar to the information gdb might give you:
$ cat stack [
] sk_wait_data+0xaa/0xc0 [ ] tcp_recvmsg+0x4e9/0xbe0 [ ] sock_common_recvmsg+0x48/0x60 [ ] sock_recvmsg+0xf5/0x120 [ ] sys_recvfrom+0xbd/0x150 [ ] sys_recv+0x3b/0x40 [ ] sys_socketcall+0x1a2/0x280 [ ] syscall_call+0x7/0xb [ ] 0xffffffff
Sadly, in this case, the culprit is a Python script. All the stack trace shows me is that I'm in Python -- not where in the Python script. What then?
Dumping the stack from Python
If the problem is in a Python script you control, it's easy to modify the your script to let you request a full stack trace at any time. The key is to set up a signal handler on the user-defined signal SIGUSR1.
Python makes this easy. The signal module lets you trap signals, and the traceback module will give you a Python stack trace. Add this anywhere in your script:
import traceback, signal # Print a current stack trace: def dumpstack(signum, frame): print "Caught SIGUSR1: dumping stack!\n" print '\n'.join([ a + ' ' + str(b) + ': ' + d \ for (a, b, c, d) in traceback.extract_stack() ]) # Call dumpstack whenever we receive SIGUSR1: signal.signal(signal.SIGUSR1, dumpstack)
Using print assumes the program is running from somewhere where you'll see the output. If this is a system daemon, you may prefer to write the stack to a log file or some other predictable place.
Once you've built in signal handling, use kill to send the USR1 signal to your process ID:
$ kill -USR1 9997
The program prints:
Caught SIGUSR1: dumping stack! /home/akkana/bin/getstuff.py 14: response = urllib2.urlopen("http://example.com/testcgi/") /usr/lib/python2.6/urllib2.py 126: return _opener.open(url, data, timeout) /usr/lib/python2.6/urllib2.py 391: response = self._open(req, data) /usr/lib/python2.6/urllib2.py 409: '_open', req) /usr/lib/python2.6/urllib2.py 369: result = func(*args) /usr/lib/python2.6/urllib2.py 1161: return self.do_open(httplib.HTTPConnection, req) /usr/lib/python2.6/urllib2.py 1134: r = h.getresponse() /usr/lib/python2.6/httplib.py 986: response.begin() /usr/lib/python2.6/httplib.py 391: version, status, reason = self._read_status() /usr/lib/python2.6/httplib.py 349: line = self.fp.readline() /usr/lib/python2.6/socket.py 397: data = recv(1) /home/akkana/bin/getstuff.py 9: for (a, b, c, d) in traceback.extract_stack() ]) Traceback (most recent call last): File "/home/akkana/bin/getstuff.py", line 14, in
response = urllib2.urlopen("http://example.com/testcgi/") File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.6/urllib2.py", line 391, in open response = self._open(req, data) File "/usr/lib/python2.6/urllib2.py", line 409, in _open '_open', req) File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain result = func(*args) File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open raise URLError(err) urllib2.URLError:
Voil´┐Ż! Now I know exactly where the program is failing: it's in Python's urllib2.urlopen, called from line 14 of my Python script. If I need to set a different timeout, or otherwise guard against a process that never returns, I know where to put the code.
Did you notice the last line?
Sending the signal interrupted what the program was doing, reading from the network, and made that call return an error. It doesn't necessarily kill the program -- but your program may exit if you don't handle exceptions.