Glossary term

Watchdog Timer

Engineering definition of a watchdog timer covering fault detection, refresh rules, windowed watchdogs, safe-state timing and validation evidence.

Definition

device

A watchdog timer is a hardware or independent supervision timer that triggers a reset, interrupt, fault latch or safe-state action if software or a supervised function fails to refresh it correctly within a defined time window.

Watchdog timers are used in embedded systems, controllers, medical devices, power electronics, aircraft equipment and industrial automation to detect stalled execution, scheduler starvation, interrupt lockout, deadlock, corrupted state and failed health supervision. A watchdog is not a complete recovery design by itself; it must be tied to safe outputs, reset diagnosis, startup rules and validation evidence.

A watchdog timer is a supervision timer used to detect that software, a scheduler, a task set or a safety-relevant function has stopped making acceptable progress. If the watchdog is not refreshed correctly, it can reset the processor, assert an interrupt, latch a fault, disable outputs or force a safe state.

The engineering value of a watchdog is not the reset itself. The value is bounded fault detection and controlled recovery. A watchdog that simply restarts a processor while leaving outputs unsafe, losing diagnostic context or repeating the same fault is weak evidence.

Basic Timeout Rule

A simple watchdog timeout must be longer than the longest valid refresh interval and shorter than the unsafe exposure time:

T_{normal,max}<T_{wd}<T_{unsafe}

where T_normal,max is the worst acceptable normal refresh interval, T_wd is the watchdog timeout and T_unsafe is the time before stale output, uncontrolled motion, missing therapy, thermal stress or another hazard becomes unacceptable.

This inequality is only a first screen. It does not prove that the right task is being supervised.

Health-Gated Refresh

A robust design refreshes the watchdog only after critical health evidence is current. For example, a firmware supervisor may require:

H=H_{control}H_{comm}H_{sensor}H_{output}=1

where each H term is a Boolean health flag based on real task progress or a fresh diagnostic result. If a background task or interrupt refreshes the watchdog regardless of control-loop progress, the watchdog can remain alive while the function that matters is frozen.

Health evidence may include advancing task counters, fresh sensor data, valid output update, no queue overflow, no critical interlock fault and successful communication within a defined age limit.

Windowed Watchdog

A windowed watchdog requires refresh inside a valid time interval:

T_{min}<T_{refresh}<T_{max}

Refreshing too late indicates a stall. Refreshing too early can indicate a runaway loop, corrupted scheduler state or a task that is bypassing normal supervision. Windowed designs are useful when a simple “kick the watchdog often enough” rule is too easy to satisfy accidentally.

The minimum window must be compatible with the health-check period. If health bits update every 20 ms, a refresh window that opens after 40 ms requires the supervisor to observe at least two valid health cycles before refresh.

Fault-to-Safe Time

For safety-related use, the important timing number is often the time from fault occurrence to safe state:

T_{safe}=T_{detect}+T_{reset}+T_{output}

where T_detect is watchdog detection time, T_reset is reset or recovery time and T_output is time for output hardware or interlocks to reach the safe condition. A release requirement may state:

T_{safe}\leq T_{req}

The watchdog timeout alone is not enough if boot time, actuator discharge, brake engagement, valve closure, drive-disable propagation or supervisory restart time dominate the hazard.

Worked Example

An embedded controller has measured maximum normal refresh interval:

T_{normal,max}=38\ \text{ms}

The selected watchdog timeout is:

T_{wd}=120\ \text{ms}

The stale-output hazard window is:

T_{unsafe}=220\ \text{ms}

The basic timeout screen passes:

38<120<220

Reset and recovery require:

T_{reset}=45\ \text{ms},\quad T_{output}=12\ \text{ms}

Worst-case fault-to-safe time is:

T_{safe}=120+45+12=177\ \text{ms}

If the requirement is:

T_{req}=200\ \text{ms}

then the safe-state margin is:

M_{safe}=200-177=23\ \text{ms}

Now check the window. If the watchdog opens at:

T_{min}=60\ \text{ms}

and an interrupt refreshes every:

T_{refresh}=40\ \text{ms}

the refresh is invalid by:

60-40=20\ \text{ms}

This is a design fault, even though the refresh is frequent. The refresh must prove system health, not only processor activity.

Boundary With Interlocks

A watchdog should not be the only protective layer. Hardware interlocks, output enables, brown-out behavior, actuator fail-safe states, current limits, emergency stops and independent protection logic may be needed outside the firmware objective.

The watchdog can detect loss of progress, but an interlock may be the mechanism that actually removes energy or prevents unsafe motion during reset. Engineering review should state which layer detects the fault, which layer makes the output safe and which evidence proves the sequence.

Validation Evidence

Useful evidence includes timing traces, task heartbeat logs, watchdog configuration, refresh decision table, reset-cause register capture, retained pre-reset context, boot timing, output-state measurements during reset, brown-out tests, interrupt-lockout injection, stuck-task tests, communication-deadlock tests, queue-overload tests and recovery tests from each firmware mode.

Common mistakes include refreshing the watchdog from a timer interrupt, clearing it before checking critical health bits, setting a timeout from convenience rather than hazard timing, assuming reset is safe, losing the original fault cause, ignoring window constraints, skipping brown-out behavior and validating only the successful path. A strong watchdog design states the timeout, window, health evidence, safe-state action, recovery rule and fault-injection evidence.

REF

See also