RT Watchdog

From RTwiki
(Difference between revisions)
Jump to: navigation, search
(Add CodeRef to rt-watchdog source code)
m
Line 1: Line 1:
 
While developing realtime software, it is nice to have something to cover your back in case, heaven forbid, a bit of infinite looping gets introduced into your code and gets run at a realtime priority.  Especially when developing on a UP system, realtime development can be very dangerous, causing you to hit the reset button frequently if you are not very careful.
 
While developing realtime software, it is nice to have something to cover your back in case, heaven forbid, a bit of infinite looping gets introduced into your code and gets run at a realtime priority.  Especially when developing on a UP system, realtime development can be very dangerous, causing you to hit the reset button frequently if you are not very careful.
  
The rt-watchdog program is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes.  It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically;  the second, a SCHED_OTHER thread that is the canary.  When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time.  When this happens, it takes some action to remedy the situation.  The default action is to kill one or more processes. So it looks  at  all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.
+
The rt-watchdog program, written by [[User:Vhmauery]] is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes.  It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically;  the second, a SCHED_OTHER thread that is the canary.  When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time.  When this happens, it takes some action to remedy the situation.  The default action is to kill one or more processes. So it looks  at  all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.
  
 
When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation.  In addition to taking some action, it also dumps messages to syslog.
 
When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation.  In addition to taking some action, it also dumps messages to syslog.

Revision as of 10:37, 10 October 2006

While developing realtime software, it is nice to have something to cover your back in case, heaven forbid, a bit of infinite looping gets introduced into your code and gets run at a realtime priority. Especially when developing on a UP system, realtime development can be very dangerous, causing you to hit the reset button frequently if you are not very careful.

The rt-watchdog program, written by User:Vhmauery is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes. It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically; the second, a SCHED_OTHER thread that is the canary. When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time. When this happens, it takes some action to remedy the situation. The default action is to kill one or more processes. So it looks at all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.

When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation. In addition to taking some action, it also dumps messages to syslog.

Be aware that this program was meant for debugging and development of a real-time system. If the system is simply overloaded by too many real-time processes, and the canary is starved of system time, it will start killing processes even if they really are not technically runaway. rt-watchdog was not intended for production use.

Documentation

FIXME
Add documentation from man page, and from the textual description above

Source Code

Personal tools