Case study
Embedded Watchdog and Fault Recovery Case Study
Case study of an embedded motor controller fault-recovery redesign, covering watchdog timing, scheduler blocking, flash logging, safe outputs, reset diagnosis, brown-out behavior, communication faults, and validation evidence.
This case study follows an embedded motor-controller team investigating rare field resets and unsafe recovery behavior. The controller drives a small industrial actuator, reads a quadrature encoder, closes a position loop, communicates with a supervisory controller, and stores diagnostic events in nonvolatile memory.
The first product version includes a watchdog, but the watchdog is treated as a reset button rather than a designed recovery function. Field evidence shows that the device sometimes restarts with the actuator still energized, loses the reason for the reset, and repeats the same fault when the triggering condition remains present.
The engineering question is:
Does the watchdog make the system safer and more diagnosable, or does it only hide an unbounded timing and state-management problem?
Case Summary
| Item | Engineering relevance |
|---|---|
| Product | Embedded actuator controller with motor drive, encoder, limit inputs, and fieldbus communication. |
| Main symptom | Rare watchdog resets during high bus traffic and event logging. |
| Initial risk | Motor output can remain enabled briefly during reset and recovery. |
| Root causes | Long non-preemptive flash logging, weak safe-state hardware, incomplete reset diagnosis, and ambiguous recovery mode. |
| Corrective action | Redesign watchdog timing, scheduler boundaries, logging path, output interlock, reset records, and validation tests. |
| Required evidence | Timing traces, fault-injection results, brown-out tests, communication stress tests, and safe-output verification. |
The case is realistic because watchdog failures are rarely only software failures. They usually involve firmware timing, power behavior, hardware output states, logging, diagnostics, and operator recovery.
Initial Architecture
The controller has:
- a 120 MHz microcontroller;
- a 1 kHz position-control loop;
- encoder capture interrupts;
- a fieldbus communication task;
- nonvolatile event logging;
- a motor-drive enable line;
- two limit inputs;
- a hardware watchdog with configurable timeout;
- a brown-out detector;
- a diagnostic LED and fieldbus status register.
The nominal firmware modes are boot, self-test, idle, enabled, motion, fault, recovery, and firmware update. In the first release, the watchdog is refreshed from the main loop whenever the scheduler is alive.
That design misses a key point: a scheduler can be alive while a safety-critical task is starved, and a reset can be unsafe if output states are not controlled independently.
Field Event
Field logs show a recurring pattern:
- high fieldbus traffic starts during machine commissioning;
- multiple warning events are written to flash;
- the motor command stops updating for one control cycle group;
- the watchdog resets the microcontroller;
- the controller boots quickly;
- the drive enable line briefly returns high before the fault state is reconstructed;
- the supervisory controller reports a generic communication timeout, not the original cause.
The reset prevents a permanent freeze, but it does not provide a controlled recovery. The team stops treating the watchdog as the root solution and analyzes the timing path.
Timing Budget
The 1 kHz control loop has period:
The measured worst-case control computation is:
Sensor processing and actuator update require:
The basic loop time is:
Deadline margin before other interference:
This looks acceptable, but it excludes blocking and interrupt interference. The missed deadlines appear only when flash logging and bus traffic overlap.
Blocking Fault
The event logger writes a diagnostic page to flash. During part of the flash operation, the first-release driver disables interrupts for:
The control loop deadline is:
Even before considering the rest of the response path:
The system cannot meet a 1 ms control deadline while interrupts are disabled for 18 ms. The watchdog reset is therefore a symptom of a design violation, not the primary fault.
Watchdog Timing Review
The first release sets:
The longest normal scheduler gap was assumed to be:
The actuator can tolerate stale command output for at most:
The watchdog condition:
appears satisfied:
However, the premise is wrong. The 18 ms non-preemptive flash section can combine with communication bursts and delayed recovery, so the controller can refresh the watchdog while the control loop is already unhealthy.
Failure-Mode Reframing
The team redefines the failure mode from “watchdog reset occurred” to:
The actuator command path can stop being updated or can restart without a guaranteed safe output state.
This reframing changes the corrective action. The team must prove:
- critical control work cannot be blocked by diagnostics;
- the watchdog is refreshed only when safety-critical tasks have progressed;
- outputs go to a safe state during reset, brown-out, boot, and watchdog recovery;
- reset cause and pre-reset context are preserved;
- recovery does not automatically re-enable motion without authorization.
Corrective Architecture
The revised architecture includes:
- a windowed watchdog supervised by a health monitor task;
- per-task heartbeat counters for control, encoder, communication, and diagnostics;
- no watchdog refresh unless critical heartbeats advance inside their deadlines;
- flash logging moved to a bounded background state machine;
- interrupts disabled only for short register-critical sections;
- motor enable gated by a hardware interlock that defaults off during reset;
- brown-out threshold set above the region where outputs become ambiguous;
- retained reset record with reset cause, task state, fault code, and timestamp;
- recovery mode that requires actuator-safe confirmation before re-enable.
The watchdog is now part of a fault-management design, not a single line of defensive firmware.
Scheduler and Utilization Check
After redesign, the periodic workload is:
| Task | Period | WCET |
|---|---|---|
| Control loop | 1\ \text{ms} | 520\ \mu\text{s} |
| Communication service | 5\ \text{ms} | 420\ \mu\text{s} |
| Health monitor | 10\ \text{ms} | 120\ \mu\text{s} |
| Diagnostic logger step | 20\ \text{ms} | 500\ \mu\text{s} |
Utilization is:
The utilization screen is now:
This does not prove schedulability, but it gives room for interrupt overhead and measurement margin. The team then verifies response times with traces.
Safe Output Design
The first design relied on firmware to de-energize the motor output after reset. The revised design changes the hardware and boot sequence:
- the motor-drive enable pin has a hardware pull-down;
- the drive gate-enable signal requires a firmware enable and an independent safety interlock;
- the bootloader leaves actuator outputs disabled;
- the application enables outputs only after self-test and fault-state reconstruction;
- a watchdog reset latches a fault that prevents automatic motion restart;
- the supervisory controller must send an explicit re-enable command after the fault is acknowledged.
This makes reset behavior deterministic. A watchdog reset can no longer pass through a brief uncontrolled output state.
Reset Record
The retained reset record includes:
| Field | Purpose |
|---|---|
| Reset cause | Watchdog, brown-out, software reset, external reset, update reset. |
| Last healthy task bitmap | Shows which critical task stopped advancing. |
| Last scheduler tick | Estimates time since last healthy state. |
| Output state before reset | Confirms whether actuator was enabled. |
| Bus error counters | Identifies communication stress or bus lockup. |
| Flash logger state | Shows whether logging was active. |
| Supply-voltage minimum | Supports brown-out diagnosis. |
| Firmware build identity | Ties the evidence to the deployed configuration. |
The record is small enough to update safely and is protected against partial writes. It turns a generic reset into a diagnosable event.
Fault-Injection Tests
The validation plan includes deliberate fault cases:
- force an infinite loop in a noncritical task;
- block the communication task while control remains active;
- block the control heartbeat;
- simulate fieldbus traffic bursts;
- inject flash-write interruption;
- force brown-out near the reset threshold;
- disconnect encoder signals during motion;
- hold a limit input active during boot;
- corrupt a retained reset record;
- power-cycle during recovery mode.
Each test has an expected safe response. A watchdog test passes only if the output state, reset record, boot path, and recovery mode are all correct.
Validation Results
The revised controller is tested under communication stress and event-logging load. Measured worst observed response times are:
| Function | Deadline | Worst observed response |
|---|---|---|
| Control loop | 1.0\ \text{ms} | 0.74\ \text{ms} |
| Encoder service | 0.5\ \text{ms} | 0.21\ \text{ms} |
| Health monitor | 10\ \text{ms} | 2.8\ \text{ms} |
| Safe output disable after fault | 5.0\ \text{ms} | 1.6\ \text{ms} |
Control-loop margin:
The margin is positive, but the team records the test conditions: bus load, logging rate, firmware build, compiler version, temperature, supply voltage, and instrumentation method.
Brown-Out Recovery
Brown-out testing finds a second issue: the old threshold allowed the microcontroller to continue running while the motor-drive supply was already outside its valid range. The revised design raises the brown-out threshold and verifies that the drive enable remains off during supply collapse and recovery.
This matters because a watchdog cannot solve every fault. If the processor supply, output driver, or external actuator supply is invalid, the safe state must be enforced by electrical design as well as firmware.
Engineering Outcome
The final release changes the product behavior:
- flash logging can no longer block control deadlines;
- watchdog refresh depends on critical task health, not only main-loop execution;
- reset cause and pre-reset context are preserved;
- actuator outputs default safe during reset and boot;
- watchdog recovery enters a latched fault state;
- field diagnostics distinguish watchdog, brown-out, bus overload, and encoder faults;
- validation includes negative tests instead of only nominal motion tests.
The team also updates service documentation. A watchdog reset is no longer reported as a generic communication dropout. It is reported with the associated task health and recovery state.
Lessons
The main lesson is that a watchdog is not a substitute for bounded design. It is a last-resort recovery mechanism that must be connected to timing analysis, hardware safe states, diagnostics, and validation.
Another lesson is that reset is an operating mode. Outputs, communication, calibration, logs, and actuator permissions must be specified during reset, boot, recovery, and update, not only during normal operation.
The best watchdog design does not simply restart the product. It makes unsafe states harder to reach, makes failures visible, and gives the system a controlled path back to a safe, diagnosable state.