Watch Zombie Processes on Linux

:heavy_exclamation_mark: This post is older than a year. Consider some information might not be accurate anymore. :heavy_exclamation_mark:

On Unix and Unix-like computer operating systems, a zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the “Terminated state”.

The term zombie process derives from the common definition of zombie - an undead person. In the term’s metaphor, the child process has “died” but has not yet been “reaped”. Also, unlike normal processes, the kill command has no effect on a zombie process.

Source: wikipedia.org, 2018-04-12

Resource Leak

After the zombie is removed, its process identifier (PID) and entry in the process table can then be reused. However, if a parent fails to call wait, the zombie will be left in the process table, causing a resource leak.

As with other resource leaks, the presence of a few zombies is not worrisome in itself, but may indicate a problem that would grow serious under heavier loads. Since there is no memory allocated to zombie processes – the only system memory usage is for the process table entry itself – the primary concern with many zombies is not running out of memory, but rather running out of process table entries, concretely process ID numbers.

Source: wikipedia.org, 2018-04-12

The number of processes that an individual can run can be checked with ulimit:

Check user process limit

ulimit -u

Example check all limits

ulimit -u
tan@omega:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31775
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31775
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Example check user process limit

tan@omega:~$ ulimit -u

Detect Zombie Processes

How can you detect Zombies? Zombies can be identified in the output from the Unix ps command by the presence of a “Z” in the “STAT” column.

vinh@omega:/etc/opt/six/fo/monit> ps -el | grep 'Z'
0 Z     0 363501   2617  0  80   0 -     0 exit   ?        00:00:00 sh <defunct>
4 Z     0 431579 130477  3  80   0 -     0 exit   ?        00:00:00 docker-current <defunct>

You can also use top. Starting top -H will show the amount of threads, instead of processes.

top - 13:13:29 up 167 days, 21:45,  5 users,  load average: 3.98, 4.28, 4.05
Threads: 40110 total,   4 running, 40104 sleeping,   0 stopped,   2 zombie
%Cpu(s):  3.7 us,  1.6 sy,  0.1 ni, 94.4 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 13112041+total,  8070096 free, 64865496 used, 58184820 buff/cache
KiB Swap:  4194300 total,  4139660 free,    54640 used. 65108580 avail Mem

The amount of zombies should be monitored. On another server:

top - 13:39:28 up 183 days,  2:56,  2 users,  load average: 3.22, 2.80, 2.45
Tasks: 2023 total,   2 running, 1622 sleeping,   0 stopped, 399 zombie
%Cpu(s):  4.1 us,  1.2 sy,  0.1 ni, 94.3 id,  0.2 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 13112041+total,  3482208 free, 74123296 used, 53514912 buff/cache
KiB Swap:  4194300 total,  3827404 free,   366896 used. 56010040 avail Mem

To check all your servers with Ansible:

vinh@alpha:~> /bin/ansible all -m shell -a "ps -el | grep 'Z' | sed 1d | wc -l"

The sed 1d command ignores the line of output. Since it is a header, we do not count them.

Resolve Issue

The quotes are from an article by Benjamin Cane.

If the parent process of the zombie or zombies is still active (not process id 1) than this is an indication that the parent process is stalled on a certain task and has not yet read the exit status of the child processes. At this point the resolution is extremely situational, you can use the strace command to attach to the parent process and troubleshoot from there.

You may also be able to make the parent process exit cleanly taking its zombie children by gracefully stop or restart the process.

Example check parent process with pstree

vinh@omega:/etc/opt/six/fo/monit> ps -el | grep 'Z'
0 Z     0 363501   2617  0  80   0 -     0 exit   ?        00:00:00 sh <defunct>
4 Z     0 431579 130477  3  80   0 -     0 exit   ?        00:00:00 docker-current <defunct>

vinh@omega:/etc/opt/six/fo/monit> pstree -p -s 363501

If the parent process is no longer active than the clean up activity becomes a choice; at this point you can leave the zombie processes on your system, or you can simply reboot. A Zombie process whose parent is no longer active is not going to be cleaned up without rebooting the system. If the zombie processes are only in small numbers and not reoccurring or multiplying than it may be best to leave these processes be until the next reboot. If however they are multiplying or in a large number than this is an indication that there is a significant issue with your system.

Please remember the terms for blog comments.