RT Watchdog

From RTwiki
Revision as of 15:20, 7 October 2006 by Tytso (Talk | contribs)

Jump to: navigation, search

While developing realtime software, it is nice to have something to cover your back in case, heaven forbid, a bit of infinite looping gets introduced into your code and gets run at a realtime priority. Especially when developing on a UP system, realtime development can be very dangerous, causing you to hit the reset button frequently if you are not very careful.

The rt-watchdog program is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes. It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically; the second, a SCHED_OTHER thread that is the canary. When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time. When this happens, it takes some action to remedy the situation. The default action is to kill one or more processes. So it looks at all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.

When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation. In addition to taking some action, it also dumps messages to syslog.

Be aware that this program was meant for debugging and development of a real-time system. If the system is simply overloaded by too many real-time processes, and the canary is starved of system time, it will start killing processes even if they really are not technically runaway. rt-watchdog was not intended for production use.

Documentation

FIXME
Add documentation from man page, and from the textual description above

Source Code

Personal tools