RT Watchdog

From RTwiki
Revision as of 23:11, 6 October 2006 by Vhmauery (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

While developing realtime software, it is nice to have something to cover your back in case, heaven forbid, a bit of infinite looping gets introduced into your code and gets run at a realtime priority. Especially when developing on a UP system, realtime development can be very dangerous, causing you to hit the reset button frequently if you are not very careful.

Originally, the posix cpu timers were disabled in the -rt tree because it is a high latency procedure that requires lots of locks. John Stultz put in a lot of time to remove this call from the ineterrupt context and put it into a per-cpu process. But before he fixed the posix cpu timers got fixed in the -rt tree, I wrote this program that would go on a killing spree if the canary ever got starved. So it goes something like this:

rt-watchdog is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes. It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically; the second, a SCHED_OTHER thread that is the canary. When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time. When this happens, it takes some action to remedy the situation. The default action is to kill one or more processes. So it looks at all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.

When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation. In addition to taking some action, it also dumps messages to syslog.

Be aware that this program was meant for debugging and development of a real-time system. If the system is simply overloaded by too many real-time processes, and the canary is starved of system time, it will start killing processes even if they really are not technically runaway. rt-watchdog was not intended for production use.

Personal tools