Code/rt-watchdog.8

From RTwiki
Jump to: navigation, search

.\" Process this file with .\" groff -man -Tascii rt-watchdog.8 .\" .\"{{{}}} .\"{{{ Title .TH rt\-WATCHDOG 8 "Mar 24, 2006" "" "Linux System Administrator's Manual" .\"}}} .\"{{{ Name .SH NAME rt\-watchdog \- a real-time userspace watchdog .\"}}} .\"{{{ Synopsis .\" Usage: watchdog [-t n] [-v] [-d] [-i] [-n p] [-s signal_num] .\" watchdog [-t n] [-v] [-d] [-i] -r .\" watchdog [-t n] [-v] [-d] [-i] -e program args... .\" watchdog [-k] .SH SYNOPSIS .\" Kill mode .B rt\-watchdog .RB [ \-t .IR sec ] .RB [ \-d ] .RB [ \-v ] .RB [ \-i .IR msec ] .RB [ \-n .IR pid ] .RB [ \-s .IR signal_num ] .br .\" Reboot mode .B rt\-watchdog .RB [ \-t .IR sec ] .RB [ \-d ] .RB [ \-v ] .RB [ \-i .IR msec ] .B \-r .br .\" Execute mode .B rt\-watchdog .RB [ \-t .IR sec ] .RB [ \-d ] .RB [ \-v ] .RB [ \-i .IR msec ] .B \-e .B program args... .br .\" kill daemon .B rt\-watchdog .B \-k .br .\" help .B rt\-watchdog .B \-h .SH DESCRIPTION .B rt\-watchdog is a general userspace watchdog program to prevent the system from being taken over by runaway real-time processes. It launches two threads: the first a SCHED_FIFO 99 thread that runs periodically; the second, a SCHED_OTHER thread that is the canary. When the canary is starved, it stops singing and the watchdog presumes that some (often the number of CPUs) number of processes are hogging all the CPU time. When this happens, it takes some action to remedy the situation. The default action is to kill one or more processes. So it looks at all the real-time threads and processes currently running and determines which ones are taking the most time and kills them systematically.

When the canary stops singing, the watchdog takes a pre-defined action based on command-line arguments; it either kills runaway processes with a specified signal, reboots the machine, or executes a specified program to remedy the situation. In addition to taking some action, it also dumps messages to syslog.

Be aware that this program was meant for debugging and development of a real-time system. If the system is simply overloaded by too many real-time processes, and the canary is starved of system time, it will start killing processes even if they really are not technically runaway. rt\-watchdog was not intended for production use. .SH OPTIONS .IP "-t sec" Run for .I sec seconds and then terminate. This defaults to -1 for infinite. .IP -d Run in debug mode; do not daemonize and detach from the controlling terminal. .IP -v Run with verbose messages. This implies .B \-d so there is a terminal to write messages to. .IP \-i Interval in seconds the watchdog checks on the canary. The canary runs 10 times as often to ensure a chance for it to sing between watchdog checks. This defaults to 3 seconds and intervals less than 1 are not allowed. If the system is expected to have a heavy load for more than this interval time, the watchdog will be in danger of going off (and taking action). If this is the case, try a larger interval to give the canary a longer chance to sing between watchdog checks. .IP "-n pid" Do not kill process with .I pid for process ID. This can be used multiple times, but should be used with caution. The watchdog will not kill itself, but if there are other system-critcal processes that should not be killed, they can be listed here. .IP -k Kill the currently running rt\-watchdog daemon. Normally, the daemon will write to a pid file. This option will read the pid from that file and terminate that process. Note that with this option, all others are ignored. .IP -h Display a short help message and options. .IP "\fBWATCHDOG ACTIONS\fR" Only one of the following actions can be taken when the watchdog notices the canary has been starved. The default is "-s 9" or to kill runaway real-time processes with SIGKILL. .IP "-s signal_num" Kill runaway real-time processes with signal. The method for determining whether or not a process is a 'runaway' process is by determining which real-time thread or process takes the most CPU time during a single watchdog interval period. The watchdog kills the process (or process the thread belongs to) with the specified signal. After killing a process, the watchdog continues to check for the canary. If, after killing a process, the canary still does not sing, the watchdog will kill another. .I signal_num when the canary is starved between watchdog checks. This is the default action for rt\-watchdog to take and the default signal is SIGKILL. .IP -r Reboot the system when the canary is starved between watchdog checks. To reboot, rt\-watchdog calls .BR sync (2) and then .BR reboot (2) so be sure that those calls will sucessfully reboot your machine. .IP -e Execute .I "program args..." when the canary is starved between watchdog checks. .SH BUGS No documented bugs. All of those have been fixed. .SH AUTHOR Vernon Mauery <vernux@us.ibm.com>

Personal tools