Exercise set

Firmware Reliability, Watchdog, Update, and Safe-State Exercises

Solved firmware reliability exercises for watchdog windows, rollback, reset loops, flash wear, brown-out hold-up, safe states and release evidence.

These exercises treat embedded firmware reliability as a recovery and release-evidence problem. The emphasis is watchdog configuration, rollback, reset-loop escape, flash endurance, memory drift, brown-out hold-up, safe-state latency and diagnostic evidence. Scheduling and hard real-time latency calculations are covered in the companion real-time scheduling exercise set.

Assume simplified screening models unless an exercise states otherwise. Real release evidence should include fault injection, target-board power tests, persistent-state inspection, code version control, watchdog traces, boot logs, flash erase counters, safe-output measurements and acceptance criteria linked to requirements.

Release Evidence Notes

Firmware reliability is not demonstrated by showing that a board usually reboots. A credible release package should prove that the product detects the fault, reaches a defined safe or degraded state, preserves the evidence needed for diagnosis and avoids a repeated failure path.

The evidence should identify the fault mode, firmware build, bootloader version, nonvolatile-memory layout, watchdog configuration, brown-out threshold, safe-output hardware path and test method used to obtain the result.

Engineering Boundary Notes

These calculations are screening exercises. They do not replace hardware-in-the-loop testing, destructive power interruption tests, flash endurance characterization, electromagnetic immunity testing, compiler qualification, cybersecurity review or product-specific safety analysis. If a failure can harm people, property or mission availability, a numerical pass must be supported by traceable verification evidence.

Scenario Map

ScenarioExercisesPrimary checkEngineering decision
Watchdog and reset recovery1, 2, 3, 15, 18Window timing, fault-to-safe time, reset-loop escape and release gateDecide whether a stall becomes a controlled recovery.
Update and persistent state4, 5, 11, 12Rollback timing, brown-out energy, CRC coverage and retained logsDecide whether interrupted updates and resets remain diagnosable.
Resource endurance6, 7, 8, 13Flash wear, memory leak, stack margin and backpressureDecide whether long operation remains stable.
Safety and diagnostic evidence9, 10, 14, 16, 17Risk reduction, debounce, interlocks, degraded mode and fault-injection coverageDecide whether release evidence supports the claimed risk control.

Exercise 1: Watchdog Window Selection

A task should refresh the watchdog every 80\ \text{ms} in normal operation. Valid refresh variation is \pm 20\ \text{ms}. Select a watchdog open window that rejects early refreshes below 45\ \text{ms} and late refreshes above 130\ \text{ms}. Check whether normal operation is inside the window.

Solution

Normal refresh range is:

T_{min}=80-20=60\ \text{ms}
T_{max}=80+20=100\ \text{ms}

The watchdog window is:

45\ \text{ms}\le T_{refresh}\le130\ \text{ms}

Because:

60>45,\quad 100<130

normal refreshes are inside the allowed window.

Engineering Comment

The window rejects very early loop-spinning refreshes and late stalled refreshes. Release evidence should measure refresh intervals during worst diagnostic, communication and storage activity.

Plausibility Check

The normal range sits well inside 45 to 130\ \text{ms}, so the configuration is not overly tight.

Exercise 2: Fault-to-Safe Watchdog Time

A stalled controller is detected when the watchdog expires after 180\ \text{ms}. Bootloader handoff takes 70\ \text{ms}, safety initialization takes 45\ \text{ms} and output disable takes 15\ \text{ms}. Requirement is safe output within 350\ \text{ms}. Compute margin.

Solution

T_{safe}=180+70+45+15=310\ \text{ms}

Margin is:

M=350-310=40\ \text{ms}

Engineering Comment

The recovery path passes, but the evidence must show that outputs are actually de-energized after reset and not briefly re-enabled during boot.

Plausibility Check

The watchdog timeout is the dominant term; adding roughly 130\ \text{ms} of boot and output time gives a total just above 300\ \text{ms}.

Exercise 3: Reset-Loop Escape Counter

Firmware enters degraded mode after 4 watchdog resets within a 10\ \text{min} rolling window. A unit logs resets at 0, 90, 210 and 390\ \text{s}. Does it enter degraded mode?

Solution

The span from first to fourth reset is:

\Delta t=390-0=390\ \text{s}

Convert the window:

10\ \text{min}=600\ \text{s}

Since:

390<600

four resets occur inside the window, so degraded mode is required.

Engineering Comment

Reset-loop escape prevents endless rebooting from hiding a persistent fault. The retained counter must survive resets and must not be cleared before diagnostic upload.

Plausibility Check

Four resets in six and a half minutes is clearly inside a ten-minute window.

Exercise 4: Firmware Rollback Timing

An update has stages: image validation 4.0\ \text{s}, swap preparation 2.5\ \text{s}, bank copy 12.0\ \text{s} and rollback marker write 0.8\ \text{s}. The maintenance window allows 25\ \text{s}. Compute timing margin.

Solution

T_{update}=4.0+2.5+12.0+0.8=19.3\ \text{s}
M=25.0-19.3=5.7\ \text{s}

Engineering Comment

The timing passes, but the release test should interrupt power at each state boundary and prove that the bootloader can select either the new valid image or the previous image.

Plausibility Check

The bank copy dominates the update time; a total under 20\ \text{s} is consistent.

Exercise 5: Brown-Out Hold-Up for Safe Commit

A controller needs 18\ \text{ms} to complete a safe nonvolatile commit after brown-out detection. Supply current during commit is 120\ \text{mA} at 3.3\ \text{V}. Available hold-up energy is 9.0\ \text{mJ}. Check margin.

Solution

Commit power is:

P=VI=3.3(0.120)=0.396\ \text{W}

Energy required is:

E=P t=0.396(0.018)=0.00713\ \text{J}=7.13\ \text{mJ}

Margin is:

M=9.0-7.13=1.87\ \text{mJ}

Engineering Comment

The energy screen passes, but capacitance tolerance, temperature, aging and brown-out threshold variation can consume the margin.

Plausibility Check

Roughly 0.4\ \text{W} for about 0.02\ \text{s} requires about 8\ \text{mJ}, close to the detailed result.

Exercise 6: Flash Wear-Leveling Endurance

A flash sector is rated for 100000 erase cycles. Wear leveling spreads writes over 16 sectors. The firmware records one persistent event every 30\ \text{s}. Estimate years until the erase-cycle limit.

Solution

Total supported records are:

N=100000(16)=1600000

Time is:

t=N(30)=48000000\ \text{s}

Convert to years:

t_y=\dfrac{48000000}{365(24)(3600)}=1.52\ \text{years}

Engineering Comment

This endurance is weak for most products. The design needs throttling, batching, event compression or memory with higher endurance.

Plausibility Check

One write every half minute is about one million writes per year, so a 1.6 million write capacity gives about one and a half years.

Exercise 7: Memory-Leak Endurance

A device has 240\ \text{kB} free heap after startup. A guarded minimum of 80\ \text{kB} is required. A soak test estimates a leak of 0.35\ \text{kB/h}. How long until the guard is reached?

Solution

Usable leak budget is:

M=240-80=160\ \text{kB}

Time is:

t=\dfrac{160}{0.35}=457.1\ \text{h}

Convert to days:

t_d=\dfrac{457.1}{24}=19.0\ \text{days}

Engineering Comment

A nineteen-day leak endurance is usually not acceptable for unattended equipment. The leak should be fixed or bounded by a controlled restart strategy.

Plausibility Check

At about one third of a kilobyte per hour, losing 160\ \text{kB} takes several hundred hours.

Exercise 8: Stack Margin with Nested Interrupts

A task stack is 4096\ \text{bytes}. Measured high-water use is 2650\ \text{bytes}. Worst nested interrupt use is 420\ \text{bytes} and a release guard of 512\ \text{bytes} is required. Check margin.

Solution

Worst guarded use is:

S=2650+420+512=3582\ \text{bytes}

Remaining margin is:

M=4096-3582=514\ \text{bytes}

Engineering Comment

The screen barely passes. The released build should include the same compiler options, debug settings and interrupt nesting as the measurement.

Plausibility Check

The guard alone is about 0.5\ \text{kB}, so only about another 0.5\ \text{kB} remains after all allowances.

Exercise 9: Diagnostic Coverage Risk Reduction

A failure mode has severity S=8, occurrence O=4 and detection D=7. A diagnostic reduces detection rating to 3. Compute RPN before and after.

Solution

Initial RPN:

RPN_0=SOD=8(4)(7)=224

After diagnostic:

RPN_1=8(4)(3)=96

Reduction is:

\Delta RPN=224-96=128

Engineering Comment

The numerical RPN improves, but release evidence must prove diagnostic coverage through fault injection or justified analysis.

Plausibility Check

Only the detection rating changes, so reducing it from 7 to 3 should reduce RPN by more than half.

Exercise 10: Fault-Input Debounce to Safe Output

A fault input requires 20\ \text{ms} debounce. Diagnostic task period is 10\ \text{ms}, worst task response is 6\ \text{ms} and output driver turn-off is 4\ \text{ms}. Requirement is safe output within 50\ \text{ms}. Compute margin.

Solution

Worst detection and reaction time:

T=20+10+6+4=40\ \text{ms}

Margin is:

M=50-40=10\ \text{ms}

Engineering Comment

The path passes, but the debounce rule must not mask real faults that need faster hardware protection.

Plausibility Check

The debounce interval is half the total; the remaining terms add another 20\ \text{ms}.

Exercise 11: Boot Image CRC Coverage Time

A bootloader checks a 768\ \text{kB} image. CRC throughput is 12\ \text{MB/s}. The boot budget allows 90\ \text{ms} for image verification. Check margin using 1\ \text{MB}=1024\ \text{kB}.

Solution

Image size in MB:

S=\dfrac{768}{1024}=0.75\ \text{MB}

CRC time:

t=\dfrac{0.75}{12}=0.0625\ \text{s}=62.5\ \text{ms}

Margin:

M=90-62.5=27.5\ \text{ms}

Engineering Comment

The CRC fits the boot budget. Release evidence should also show what happens on CRC failure and how the fallback image is selected.

Plausibility Check

At 12\ \text{MB/s}, checking less than one megabyte should take less than one tenth of a second.

Exercise 12: Persistent Fault Log Retention

A retained fault log has 256 slots. A reset-loop fault can write at most 4 entries per hour after throttling. How many days of reset-loop evidence can be retained?

Solution

Retention time:

t=\dfrac{256}{4}=64\ \text{h}

Convert to days:

t_d=\dfrac{64}{24}=2.67\ \text{days}

Engineering Comment

The log keeps several days of evidence, which may be enough for connected equipment but weak for devices inspected monthly.

Plausibility Check

Four entries per hour means about 96 per day; 256 slots last a bit less than three days.

Exercise 13: Queue Backpressure to Degraded Mode

A safety monitor receives diagnostic events at 50\ \text{events/s} during a fault burst and can process 35\ \text{events/s}. Degraded mode should trigger before a 200 slot queue overflows. Starting from empty, how long until the queue is full?

Solution

Net growth rate is:

r=50-35=15\ \text{events/s}

Time to fill:

t=\dfrac{200}{15}=13.3\ \text{s}

Engineering Comment

The degraded-mode trigger must occur well before 13.3\ \text{s}, or the system loses diagnostic evidence during the burst.

Plausibility Check

A net accumulation of 15 events per second fills 150 slots in ten seconds, so 200 slots in about thirteen seconds is plausible.

Exercise 14: Safe-State Interlock Proof Coverage

A test matrix has 12 required safe-state interlock cases. Ten passed, one was blocked by missing equipment and one failed. Compute completed-pass coverage and release decision if the requirement is 100\% pass.

Solution

Completed-pass coverage is:

C=\dfrac{10}{12}=0.833=83.3\%

Because the requirement is 100\% pass and one case failed, release is blocked.

Engineering Comment

Blocked tests are not passes. A release exception would need formal risk acceptance and compensating evidence.

Plausibility Check

Two of twelve cases are not successful, so coverage should be clearly below 100\%.

Exercise 15: Watchdog Service Jitter Gate

A watchdog must be serviced between 60\ \text{ms} and 140\ \text{ms}. Trace data show minimum service interval 58\ \text{ms} and maximum 121\ \text{ms}. Decide status.

Solution

The maximum interval passes because:

121<140

The minimum interval fails because:

58<60

Therefore the service pattern fails the watchdog window gate.

Engineering Comment

Early service can hide a runaway loop. The watchdog should be refreshed only after key health checks have completed.

Plausibility Check

The violation is small, but a window watchdog intentionally treats early refresh as evidence of abnormal control flow.

Exercise 16: Degraded-Mode Capacity

In degraded mode, the device must keep essential sensing, alarm output and communication alive. Their CPU loads are 12\%, 8\% and 15\%. A 20\% guard is required. Check whether a 60\% degraded CPU budget is enough.

Solution

Essential load is:

U=0.12+0.08+0.15=0.35

With guard:

U_g=0.35+0.20=0.55

Budget margin:

M=0.60-0.55=0.05

So the budget passes with 5\% CPU margin.

Engineering Comment

The degraded feature set fits, but release evidence should verify that nonessential services are actually disabled and cannot starve the essential path.

Plausibility Check

The essential load is about one third of the processor; adding a large guard brings it just under the 60\% budget.

Exercise 17: Fault-Injection Sample Coverage

A release plan requires at least 30 injections for each of 6 critical firmware fault modes. The team completed 168 injections total, evenly distributed. Does it meet the plan?

Solution

Required injections:

N_{req}=30(6)=180

Completed per mode:

N_{mode}=\dfrac{168}{6}=28

Because:

28<30

the plan is not complete.

Engineering Comment

Even distribution helps, but every critical mode is under target. Release should wait for the missing injections or a justified plan revision.

Plausibility Check

The total is only 12 injections short, which corresponds to two missing tests per mode.

Exercise 18: Firmware Reliability Release Gate

A release gate requires all four checks to pass: watchdog recovery margin, rollback power-loss test, flash endurance margin and safe-state proof. Results are pass, pass, conditional pass and pass. Determine release status.

Solution

The gate is all-of:

G=R_w \land R_u \land R_f \land R_s

A conditional pass is not a full pass. Therefore:

G=\text{blocked}

Engineering Comment

Reliability release gates should not average evidence. A weak flash-endurance condition can invalidate an otherwise strong recovery package.

Plausibility Check

Because the rule requires every check to pass, one conditional result is enough to block release.

Common Release Mistakes

  • Refreshing the watchdog from a timer interrupt instead of after meaningful health checks.
  • Treating any reboot as recovery without proving safe outputs and retained diagnostics.
  • Testing firmware updates only under clean power and never at state transitions.
  • Ignoring flash wear, retained-log overflow, stack margin and heap drift.
  • Counting blocked or conditional safe-state tests as passed evidence.
  • Letting reset-loop counters clear before field diagnostics can be retrieved.

Validation Package Checklist

  • Watchdog configuration, service trace and fault-to-safe timing evidence.
  • Reset-loop escape policy, retained counters and degraded-mode entry proof.
  • Update rollback tests with power interruption at every critical state.
  • Brown-out hold-up test with tolerance, temperature and aging assumptions.
  • Flash erase-count, wear-leveling and write-throttling evidence.
  • Stack, heap, queue and retained-log high-water evidence from stress and soak tests.
  • Safe-state interlock tests, fault injection records and requirement traceability.
REF

See also