February 21, 2019

Advanced Linux Server Troubleshooting (part 2)

All about /proc, and how to debug Python

  • December 9, 2010
  • By Akkana Peck
You know the basics of how to find out what an errant process is doing. But what do you do when the basics aren't enough?

You know the basics of how to find out what an errant process is doing -- or if not, you may want to review Troubleshooting part I first.

But sometimes those methods aren't enough. What if the failed process is on a server, or a minimal system like a sheevaplug, and you don't have tools like gdb and strace installed? Or what if the runaway process is in Python, so your gdb stack trace isn't any help? What are your options then?

Getting info without gdb or strace: /proc

Even if you don't have any system tools installed, you can get a lot of the same information from the /proc filesystem.

Suppose you have a process that seems frozen and you want to know why. Get its process ID with ps and grep:

$  ps au | grep getstuff
akkana    9997  0.3  0.1   8444  4920 pts/2    S+   20:12   0:00 python /home/akkana/bin/getstuff.py

Armed with the process ID, cd /proc/processID and poke around.

$ cd  /proc/9997
$ ls
attr/            cpuset   io        mountinfo   pagemap      smaps    task/
auxv             cwd@     latency   mounts      personality  stack    wchan
cgroup           environ  limits    mountstats  root@        stat
clear_refs       exe@     loginuid  net/        sched        statm
cmdline          fd/      maps      oom_adj     schedstat    status
coredump_filter  fdinfo/  mem       oom_score   sessionid    syscall

There's all sorts of useful information there. For instance, status tells you the current state of the process. environ gives you the environment variables the process is using, like ps aueww -- not useful in this case, but you may need it some day.

ls -l fd will show you all the files the process has open and where they live on the file system. It also tells you about other file-like objects like network sockets.

The net directory is full of arcane information mostly of use to networking gurus, but even the rest of us can sometimes glean useful information:

$ cat net/sockstat
sockets: used 49
TCP: inuse 3 orphan 0 tw 0 alloc 4 mem 2
UDP: inuse 2 mem 2
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

$ cat net/wireless
Inter-| sta-|   Quality        |   Discarded packets               | Missed | WE
 face | tus | link level noise |  nwid  crypt   frag  retry   misc | beacon | 22
 mlan0: 0001    0   197   188        0      0 9342130 506582 1258102        0

Hmm, lots of retries. That doesn't look good! Might point to a problem.

But in our current predicament, head straight to stack to get a current stack trace, similar to the information gdb might give you:

$  cat stack
[] sk_wait_data+0xaa/0xc0
[] tcp_recvmsg+0x4e9/0xbe0
[] sock_common_recvmsg+0x48/0x60
[] sock_recvmsg+0xf5/0x120
[] sys_recvfrom+0xbd/0x150
[] sys_recv+0x3b/0x40
[] sys_socketcall+0x1a2/0x280
[] syscall_call+0x7/0xb
[] 0xffffffff

Sadly, in this case, the culprit is a Python script. All the stack trace shows me is that I'm in Python -- not where in the Python script. What then?

Dumping the stack from Python

If the problem is in a Python script you control, it's easy to modify the your script to let you request a full stack trace at any time. The key is to set up a signal handler on the user-defined signal SIGUSR1.

Python makes this easy. The signal module lets you trap signals, and the traceback module will give you a Python stack trace. Add this anywhere in your script:

import traceback, signal

# Print a current stack trace:
def dumpstack(signum, frame):
  print "Caught SIGUSR1: dumping stack!\n"
  print '\n'.join([ a + ' ' + str(b) + ': ' + d \
                   for (a, b, c, d) in traceback.extract_stack() ])

# Call dumpstack whenever we receive SIGUSR1:
signal.signal(signal.SIGUSR1, dumpstack)

Using print assumes the program is running from somewhere where you'll see the output. If this is a system daemon, you may prefer to write the stack to a log file or some other predictable place.

Once you've built in signal handling, use kill to send the USR1 signal to your process ID:

$  kill -USR1 9997

The program prints:

Caught SIGUSR1: dumping stack!

/home/akkana/bin/getstuff.py 14: response = urllib2.urlopen("http://example.com/testcgi/")
/usr/lib/python2.6/urllib2.py 126: return _opener.open(url, data, timeout)
/usr/lib/python2.6/urllib2.py 391: response = self._open(req, data)
/usr/lib/python2.6/urllib2.py 409: '_open', req)
/usr/lib/python2.6/urllib2.py 369: result = func(*args)
/usr/lib/python2.6/urllib2.py 1161: return self.do_open(httplib.HTTPConnection, req)
/usr/lib/python2.6/urllib2.py 1134: r = h.getresponse()
/usr/lib/python2.6/httplib.py 986: response.begin()
/usr/lib/python2.6/httplib.py 391: version, status, reason = self._read_status()
/usr/lib/python2.6/httplib.py 349: line = self.fp.readline()
/usr/lib/python2.6/socket.py 397: data = recv(1)
/home/akkana/bin/getstuff.py 9: for (a, b, c, d) in traceback.extract_stack() ])
Traceback (most recent call last):
  File "/home/akkana/bin/getstuff.py", line 14, in 
    response = urllib2.urlopen("http://example.com/testcgi/")
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  File "/usr/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
    raise URLError(err)

Voil´┐Ż! Now I know exactly where the program is failing: it's in Python's urllib2.urlopen, called from line 14 of my Python script. If I need to set a different timeout, or otherwise guard against a process that never returns, I know where to put the code.

Did you notice the last line?


Sending the signal interrupted what the program was doing, reading from the network, and made that call return an error. It doesn't necessarily kill the program -- but your program may exit if you don't handle exceptions.

Most Popular LinuxPlanet Stories