Exercise set

Firmware Reliability, Watchdog, Update, and Safe-State Exercises

Solved firmware reliability exercises for watchdog windows, rollback, reset loops, flash wear, brown-out hold-up, safe states and release evidence.

Branch: Computer Engineering
Content: Exercise set
Updated: Jul 03, 2026
Revision: v1.0.0 · reviewed

These exercises treat embedded firmware reliability as a recovery and release-evidence problem. The emphasis is watchdog configuration, rollback, reset-loop escape, flash endurance, memory drift, brown-out hold-up, safe-state latency and diagnostic evidence. Scheduling and hard real-time latency calculations are covered in the companion real-time scheduling exercise set.

Assume simplified screening models unless an exercise states otherwise. Real release evidence should include fault injection, target-board power tests, persistent-state inspection, code version control, watchdog traces, boot logs, flash erase counters, safe-output measurements and acceptance criteria linked to requirements.

Release Evidence Notes

Firmware reliability is not demonstrated by showing that a board usually reboots. A credible release package should prove that the product detects the fault, reaches a defined safe or degraded state, preserves the evidence needed for diagnosis and avoids a repeated failure path.

The evidence should identify the fault mode, firmware build, bootloader version, nonvolatile-memory layout, watchdog configuration, brown-out threshold, safe-output hardware path and test method used to obtain the result.

Engineering Boundary Notes

These calculations are screening exercises. They do not replace hardware-in-the-loop testing, destructive power interruption tests, flash endurance characterization, electromagnetic immunity testing, compiler qualification, cybersecurity review or product-specific safety analysis. If a failure can harm people, property or mission availability, a numerical pass must be supported by traceable verification evidence.

Scenario Map

Scenario	Exercises	Primary check	Engineering decision
Watchdog and reset recovery	1, 2, 3, 15, 18	Window timing, fault-to-safe time, reset-loop escape and release gate	Decide whether a stall becomes a controlled recovery.
Update and persistent state	4, 5, 11, 12	Rollback timing, brown-out energy, CRC coverage and retained logs	Decide whether interrupted updates and resets remain diagnosable.
Resource endurance	6, 7, 8, 13	Flash wear, memory leak, stack margin and backpressure	Decide whether long operation remains stable.
Safety and diagnostic evidence	9, 10, 14, 16, 17	Risk reduction, debounce, interlocks, degraded mode and fault-injection coverage	Decide whether release evidence supports the claimed risk control.

Exercise 1: Watchdog Window Selection

A task should refresh the watchdog every $80\ \text{ms}$ in normal operation. Valid refresh variation is $\pm 20\ \text{ms}$ . Select a watchdog open window that rejects early refreshes below $45\ \text{ms}$ and late refreshes above $130\ \text{ms}$ . Check whether normal operation is inside the window.

Solution

Normal refresh range is:

T_{min}=80-20=60\ \text{ms}

T_{max}=80+20=100\ \text{ms}

The watchdog window is:

45\ \text{ms}\le T_{refresh}\le130\ \text{ms}

Because:

60>45,\quad 100<130

normal refreshes are inside the allowed window.

Engineering Comment

The window rejects very early loop-spinning refreshes and late stalled refreshes. Release evidence should measure refresh intervals during worst diagnostic, communication and storage activity.

Plausibility Check

The normal range sits well inside $45$ to $130\ \text{ms}$ , so the configuration is not overly tight.

Exercise 2: Fault-to-Safe Watchdog Time

A stalled controller is detected when the watchdog expires after $180\ \text{ms}$ . Bootloader handoff takes $70\ \text{ms}$ , safety initialization takes $45\ \text{ms}$ and output disable takes $15\ \text{ms}$ . Requirement is safe output within $350\ \text{ms}$ . Compute margin.

Solution

T_{safe}=180+70+45+15=310\ \text{ms}

Margin is:

M=350-310=40\ \text{ms}

Engineering Comment

The recovery path passes, but the evidence must show that outputs are actually de-energized after reset and not briefly re-enabled during boot.

Plausibility Check

The watchdog timeout is the dominant term; adding roughly $130\ \text{ms}$ of boot and output time gives a total just above $300\ \text{ms}$ .

Exercise 3: Reset-Loop Escape Counter

Firmware enters degraded mode after $4$ watchdog resets within a $10\ \text{min}$ rolling window. A unit logs resets at $0$ , $90$ , $210$ and $390\ \text{s}$ . Does it enter degraded mode?

Solution

The span from first to fourth reset is:

\Delta t=390-0=390\ \text{s}

Convert the window:

10\ \text{min}=600\ \text{s}

Since:

390<600

four resets occur inside the window, so degraded mode is required.

Engineering Comment

Reset-loop escape prevents endless rebooting from hiding a persistent fault. The retained counter must survive resets and must not be cleared before diagnostic upload.

Plausibility Check

Four resets in six and a half minutes is clearly inside a ten-minute window.

Exercise 4: Firmware Rollback Timing

An update has stages: image validation $4.0\ \text{s}$ , swap preparation $2.5\ \text{s}$ , bank copy $12.0\ \text{s}$ and rollback marker write $0.8\ \text{s}$ . The maintenance window allows $25\ \text{s}$ . Compute timing margin.

Solution

T_{update}=4.0+2.5+12.0+0.8=19.3\ \text{s}

M=25.0-19.3=5.7\ \text{s}

Engineering Comment

The timing passes, but the release test should interrupt power at each state boundary and prove that the bootloader can select either the new valid image or the previous image.

Plausibility Check

The bank copy dominates the update time; a total under $20\ \text{s}$ is consistent.

Exercise 5: Brown-Out Hold-Up for Safe Commit

A controller needs $18\ \text{ms}$ to complete a safe nonvolatile commit after brown-out detection. Supply current during commit is $120\ \text{mA}$ at $3.3\ \text{V}$ . Available hold-up energy is $9.0\ \text{mJ}$ . Check margin.

Solution

Commit power is:

P=VI=3.3(0.120)=0.396\ \text{W}

Energy required is:

E=P t=0.396(0.018)=0.00713\ \text{J}=7.13\ \text{mJ}

Margin is:

M=9.0-7.13=1.87\ \text{mJ}

Engineering Comment

The energy screen passes, but capacitance tolerance, temperature, aging and brown-out threshold variation can consume the margin.

Plausibility Check

Roughly $0.4\ \text{W}$ for about $0.02\ \text{s}$ requires about $8\ \text{mJ}$ , close to the detailed result.

Exercise 6: Flash Wear-Leveling Endurance

A flash sector is rated for $100000$ erase cycles. Wear leveling spreads writes over $16$ sectors. The firmware records one persistent event every $30\ \text{s}$ . Estimate years until the erase-cycle limit.

Solution

Total supported records are:

N=100000(16)=1600000

Time is:

t=N(30)=48000000\ \text{s}

Convert to years:

t_y=\dfrac{48000000}{365(24)(3600)}=1.52\ \text{years}

Engineering Comment

This endurance is weak for most products. The design needs throttling, batching, event compression or memory with higher endurance.

Plausibility Check

One write every half minute is about one million writes per year, so a $1.6$ million write capacity gives about one and a half years.

Exercise 7: Memory-Leak Endurance

A device has $240\ \text{kB}$ free heap after startup. A guarded minimum of $80\ \text{kB}$ is required. A soak test estimates a leak of $0.35\ \text{kB/h}$ . How long until the guard is reached?

Solution

Usable leak budget is:

M=240-80=160\ \text{kB}

Time is:

t=\dfrac{160}{0.35}=457.1\ \text{h}

Convert to days:

t_d=\dfrac{457.1}{24}=19.0\ \text{days}

Engineering Comment

A nineteen-day leak endurance is usually not acceptable for unattended equipment. The leak should be fixed or bounded by a controlled restart strategy.

Plausibility Check

At about one third of a kilobyte per hour, losing $160\ \text{kB}$ takes several hundred hours.

Exercise 8: Stack Margin with Nested Interrupts

A task stack is $4096\ \text{bytes}$ . Measured high-water use is $2650\ \text{bytes}$ . Worst nested interrupt use is $420\ \text{bytes}$ and a release guard of $512\ \text{bytes}$ is required. Check margin.

Solution

Worst guarded use is:

S=2650+420+512=3582\ \text{bytes}

Remaining margin is:

M=4096-3582=514\ \text{bytes}

Engineering Comment

The screen barely passes. The released build should include the same compiler options, debug settings and interrupt nesting as the measurement.

Plausibility Check

The guard alone is about $0.5\ \text{kB}$ , so only about another $0.5\ \text{kB}$ remains after all allowances.

Exercise 9: Diagnostic Coverage Risk Reduction

A failure mode has severity $S=8$ , occurrence $O=4$ and detection $D=7$ . A diagnostic reduces detection rating to $3$ . Compute RPN before and after.

Solution

Initial RPN:

RPN_0=SOD=8(4)(7)=224

After diagnostic:

RPN_1=8(4)(3)=96

Reduction is:

\Delta RPN=224-96=128

Engineering Comment

The numerical RPN improves, but release evidence must prove diagnostic coverage through fault injection or justified analysis.

Plausibility Check

Only the detection rating changes, so reducing it from $7$ to $3$ should reduce RPN by more than half.

Exercise 10: Fault-Input Debounce to Safe Output

A fault input requires $20\ \text{ms}$ debounce. Diagnostic task period is $10\ \text{ms}$ , worst task response is $6\ \text{ms}$ and output driver turn-off is $4\ \text{ms}$ . Requirement is safe output within $50\ \text{ms}$ . Compute margin.

Solution

Worst detection and reaction time:

T=20+10+6+4=40\ \text{ms}

Margin is:

M=50-40=10\ \text{ms}

Engineering Comment

The path passes, but the debounce rule must not mask real faults that need faster hardware protection.

Plausibility Check

The debounce interval is half the total; the remaining terms add another $20\ \text{ms}$ .

Exercise 11: Boot Image CRC Coverage Time

A bootloader checks a $768\ \text{kB}$ image. CRC throughput is $12\ \text{MB/s}$ . The boot budget allows $90\ \text{ms}$ for image verification. Check margin using $1\ \text{MB}=1024\ \text{kB}$ .

Solution

Image size in MB:

S=\dfrac{768}{1024}=0.75\ \text{MB}

CRC time:

t=\dfrac{0.75}{12}=0.0625\ \text{s}=62.5\ \text{ms}

Margin:

M=90-62.5=27.5\ \text{ms}

Engineering Comment

The CRC fits the boot budget. Release evidence should also show what happens on CRC failure and how the fallback image is selected.

Plausibility Check

At $12\ \text{MB/s}$ , checking less than one megabyte should take less than one tenth of a second.

Exercise 12: Persistent Fault Log Retention

A retained fault log has $256$ slots. A reset-loop fault can write at most $4$ entries per hour after throttling. How many days of reset-loop evidence can be retained?

Solution

Retention time:

t=\dfrac{256}{4}=64\ \text{h}

Convert to days:

t_d=\dfrac{64}{24}=2.67\ \text{days}

Engineering Comment

The log keeps several days of evidence, which may be enough for connected equipment but weak for devices inspected monthly.

Plausibility Check

Four entries per hour means about $96$ per day; $256$ slots last a bit less than three days.

Exercise 13: Queue Backpressure to Degraded Mode

A safety monitor receives diagnostic events at $50\ \text{events/s}$ during a fault burst and can process $35\ \text{events/s}$ . Degraded mode should trigger before a $200$ slot queue overflows. Starting from empty, how long until the queue is full?

Solution

Net growth rate is:

r=50-35=15\ \text{events/s}

Time to fill:

t=\dfrac{200}{15}=13.3\ \text{s}

Engineering Comment

The degraded-mode trigger must occur well before $13.3\ \text{s}$ , or the system loses diagnostic evidence during the burst.

Plausibility Check

A net accumulation of $15$ events per second fills $150$ slots in ten seconds, so $200$ slots in about thirteen seconds is plausible.

Exercise 14: Safe-State Interlock Proof Coverage

A test matrix has $12$ required safe-state interlock cases. Ten passed, one was blocked by missing equipment and one failed. Compute completed-pass coverage and release decision if the requirement is $100\%$ pass.

Solution

Completed-pass coverage is:

C=\dfrac{10}{12}=0.833=83.3\%

Because the requirement is $100\%$ pass and one case failed, release is blocked.

Engineering Comment

Blocked tests are not passes. A release exception would need formal risk acceptance and compensating evidence.

Plausibility Check

Two of twelve cases are not successful, so coverage should be clearly below $100\%$ .

Exercise 15: Watchdog Service Jitter Gate

A watchdog must be serviced between $60\ \text{ms}$ and $140\ \text{ms}$ . Trace data show minimum service interval $58\ \text{ms}$ and maximum $121\ \text{ms}$ . Decide status.

Solution

The maximum interval passes because:

121<140

The minimum interval fails because:

58<60

Therefore the service pattern fails the watchdog window gate.

Engineering Comment

Early service can hide a runaway loop. The watchdog should be refreshed only after key health checks have completed.

Plausibility Check

The violation is small, but a window watchdog intentionally treats early refresh as evidence of abnormal control flow.

Exercise 16: Degraded-Mode Capacity

In degraded mode, the device must keep essential sensing, alarm output and communication alive. Their CPU loads are $12\%$ , $8\%$ and $15\%$ . A $20\%$ guard is required. Check whether a $60\%$ degraded CPU budget is enough.

Solution

Essential load is:

U=0.12+0.08+0.15=0.35

With guard:

U_g=0.35+0.20=0.55

Budget margin:

M=0.60-0.55=0.05

So the budget passes with $5\%$ CPU margin.

Engineering Comment

The degraded feature set fits, but release evidence should verify that nonessential services are actually disabled and cannot starve the essential path.

Plausibility Check

The essential load is about one third of the processor; adding a large guard brings it just under the $60\%$ budget.

Exercise 17: Fault-Injection Sample Coverage

A release plan requires at least $30$ injections for each of $6$ critical firmware fault modes. The team completed $168$ injections total, evenly distributed. Does it meet the plan?

Solution

Required injections:

N_{req}=30(6)=180

Completed per mode:

N_{mode}=\dfrac{168}{6}=28

Because:

28<30

the plan is not complete.

Engineering Comment

Even distribution helps, but every critical mode is under target. Release should wait for the missing injections or a justified plan revision.

Plausibility Check

The total is only $12$ injections short, which corresponds to two missing tests per mode.

Exercise 18: Firmware Reliability Release Gate

A release gate requires all four checks to pass: watchdog recovery margin, rollback power-loss test, flash endurance margin and safe-state proof. Results are pass, pass, conditional pass and pass. Determine release status.

Solution

The gate is all-of:

G=R_w \land R_u \land R_f \land R_s

A conditional pass is not a full pass. Therefore:

G=\text{blocked}

Engineering Comment

Reliability release gates should not average evidence. A weak flash-endurance condition can invalidate an otherwise strong recovery package.

Plausibility Check

Because the rule requires every check to pass, one conditional result is enough to block release.

Common Release Mistakes

Refreshing the watchdog from a timer interrupt instead of after meaningful health checks.
Treating any reboot as recovery without proving safe outputs and retained diagnostics.
Testing firmware updates only under clean power and never at state transitions.
Ignoring flash wear, retained-log overflow, stack margin and heap drift.
Counting blocked or conditional safe-state tests as passed evidence.
Letting reset-loop counters clear before field diagnostics can be retrieved.

Validation Package Checklist

Watchdog configuration, service trace and fault-to-safe timing evidence.
Reset-loop escape policy, retained counters and degraded-mode entry proof.
Update rollback tests with power interruption at every critical state.
Brown-out hold-up test with tolerance, temperature and aging assumptions.
Flash erase-count, wear-leveling and write-throttling evidence.
Stack, heap, queue and retained-log high-water evidence from stress and soak tests.
Safe-state interlock tests, fault injection records and requirement traceability.

REF