Project

Firmware Update Rollback and Power-Loss Recovery Project

Computer engineering project for designing and validating a safe firmware-update and rollback mechanism with image slots, bootloader decisions, power-loss recovery, metadata integrity, fault tests, and release criteria.

This project designs and validates a firmware-update mechanism for an embedded controller that must survive interrupted transfers, power loss, corrupted images, incompatible configuration data, and failed first boot. The goal is not only to install new code. The goal is to prove that a failed update does not leave the product unusable or unsafe.

Firmware update reliability is a systems problem. It involves flash memory layout, bootloader behavior, image validation, metadata transactions, brown-out response, safe outputs, communication errors, service tools, version reporting, and field diagnostics. A robust design makes the update path recoverable before deployment, not after failed units appear in service.

Project Objective

Produce an engineering update-and-rollback package for a microcontroller-based product. The final deliverable should answer:

  1. Which image is active, which image is staged, and which image is trusted?
  2. What happens if power is lost during transfer, validation, metadata commit, first boot, or rollback?
  3. How is image integrity checked before any jump to application code?
  4. How are calibration and configuration data preserved across update and rollback?
  5. Which outputs remain safe during bootloader, update, reset, and recovery states?
  6. Which tests prove that the product can recover from every credible interrupted-update state?
  7. What evidence is required before the firmware package is released to field service?

The deliverable is a design-review package. It should include memory layout, bootloader state logic, power-loss test matrix, acceptance criteria, and release records.

Baseline Scenario

An industrial measurement controller uses a microcontroller with internal flash, a serial service port, analog inputs, digital outputs, and a relay interlock. The controller is installed in equipment that may lose power without warning. Service technicians need a field-update path because calibration features and communication behavior are updated after deployment.

The first release used a single application image. If an update was interrupted, the bootloader could not always distinguish an incomplete image from a valid one. The redesign introduces dual application slots, transactional metadata, integrity checks, and a probationary first-boot period.

Design Requirements

RequirementAcceptance criterion
Existing firmware remains bootable during transferactive image is not erased until staged image is valid
Corrupted candidate image is rejectedintegrity check fails before activation
Power loss is recoverableevery tested interruption state boots active image, staged image, or service mode intentionally
Calibration data are preservedcalibration checksum and version remain valid after update and rollback
Outputs remain saferelay interlock and actuator outputs remain disabled in bootloader and update mode
Failed new firmware rolls backwatchdog or application confirmation failure returns to prior valid image
Field diagnosis is possibledevice reports active version, previous version, reset cause, and update result

The requirements separate update success from recovery success. A failed update is acceptable only if the device reaches a known safe state with diagnosable evidence.

Step 1: Define Flash Memory Layout

Use a dual-slot layout so the active image is preserved while the candidate image is received.

RegionSize
bootloader and service monitor96\ \text{kB}
application slot A384\ \text{kB}
application slot B384\ \text{kB}
configuration and calibration, dual copy32\ \text{kB}
event log96\ \text{kB}
diagnostic scratch and reserved area32\ \text{kB}
total flash1024\ \text{kB}

Check that the layout fits:

96+384+384+32+96+32=1024\ \text{kB}

The candidate firmware image is:

S_{image}=332\ \text{kB}

Slot size is:

S_{slot}=384\ \text{kB}

Slot margin is:

\displaystyle M_{slot}=\frac{S_{slot}-S_{image}}{S_{slot}}
\displaystyle M_{slot}=\frac{384-332}{384}=0.135

Therefore:

M_{slot}=13.5\%

Engineering Comment

The image fits, but the margin is not large. The project should record a growth limit or require a memory review when the image exceeds a threshold, for example 360\ \text{kB}. Without a threshold, diagnostic code, security checks, or communication libraries can silently consume the reserve needed for safe future updates.

Step 2: Estimate Transfer Time and Exposure

The service link has an effective payload bandwidth of:

B_{eff}=24\ \text{kB/s}

Transfer time is:

\displaystyle t_{transfer}=\frac{S_{image}}{B_{eff}}
\displaystyle t_{transfer}=\frac{332}{24}=13.8\ \text{s}

If verification and metadata checks require another 2.5\ \text{s}, the update exposure time is approximately:

t_{exposure}=13.8+2.5=16.3\ \text{s}

Engineering Comment

The product must tolerate power loss during this whole interval. The design should assume interruption is normal, not exceptional. A field update that only works when power is stable is not a reliable update mechanism.

Step 3: Define Update Metadata as a Transaction

The bootloader should not rely on a single writable flag such as new_image_ready. Metadata should make partial writes detectable.

Use two metadata copies with:

FieldPurpose
slot identifierstates whether slot A or slot B is active
image sizeprevents executing unbounded or incomplete content
image integrity valuerejects corrupted image data
image versionsupports service diagnosis and compatibility checks
configuration format versionprevents incompatible data interpretation
boot attempt countersupports probation and rollback
commit markerdistinguishes complete metadata from interrupted write
metadata integrity valuedetects corrupted metadata

The bootloader accepts metadata only if the record is complete, internally consistent, and newer than the alternate copy.

Engineering Comment

Transactional metadata is more important than the exact field names. The engineering rule is that any interrupted write must look either like the old valid state or a rejected invalid state. It must not look like a valid command to boot unknown code.

Step 4: Check Brown-Out Hold-Up for Metadata Writes

Assume the controller has a local capacitor:

C=470\ \mu\text{F}

The brown-out supervisor detects falling supply at:

V_{high}=3.3\ \text{V}

The minimum voltage for reliable flash programming is:

V_{low}=2.7\ \text{V}

Available capacitor energy is:

\displaystyle E=\frac{1}{2}C(V_{high}^2-V_{low}^2)
\displaystyle E=\frac{1}{2}(470\times10^{-6})(3.3^2-2.7^2)
E=0.000846\ \text{J}

During a flash metadata write, assume:

I=80\ \text{mA}

and average voltage:

V_{avg}=3.0\ \text{V}

Load power is:

P=IV=0.080(3.0)=0.240\ \text{W}

Estimated hold-up time is:

\displaystyle t_{hold}=\frac{E}{P}=\frac{0.000846}{0.240}=0.00353\ \text{s}

Therefore:

t_{hold}=3.5\ \text{ms}

If the metadata write time is:

t_{write}=8\ \text{ms}

then:

t_{hold}<t_{write}

Engineering Comment

The hardware cannot guarantee that an in-progress metadata write finishes after power loss. The design must therefore use dual-copy metadata, commit markers, and brown-out blocking. The firmware should also refuse to start a critical write if supply voltage is already near the brown-out threshold.

Step 5: Define Bootloader Decision Logic

The bootloader should make a deterministic choice before any application code runs.

Condition at bootBootloader action
active slot valid and no candidate pendingboot active slot
candidate fully received and integrity validmark candidate as pending confirmation, then boot candidate
candidate incomplete or integrity invalidreject candidate and boot previous active slot
candidate booted but did not confirm before timeoutroll back to previous active slot
both application slots invalidremain in service mode with outputs disabled
metadata copies disagreeuse newest valid copy, otherwise enter service mode
configuration version incompatible with candidatereject candidate unless migration test is approved

The application must explicitly confirm that it has reached a safe operational state. Confirmation should occur only after initialization, configuration loading, basic self-test, and safe-output checks are complete.

Engineering Comment

Immediate activation is risky. A candidate image that starts executing is not necessarily safe. Probationary boot makes the new firmware prove that it can initialize, read configuration, supervise outputs, and communicate before it becomes permanent.

Step 6: Preserve Calibration and Configuration

Calibration data should not be rewritten casually during firmware update. Treat it as controlled engineering data.

Minimum checks:

  1. calibration copy A and copy B both include version, length, and integrity value;
  2. the new firmware declares which calibration format versions it can read;
  3. any migration has a reversible or backed-up path;
  4. rollback can still interpret the stored configuration;
  5. service tools can report calibration version and update result.

If the new firmware requires a configuration migration that the old firmware cannot read, rollback may be blocked. In that case the release package must either make the migration backward compatible or treat rollback as a controlled service action, not an automatic promise.

Step 7: Safe Outputs During Update

The update mode must not energize outputs by accident.

Use a conservative output contract:

StateOutput rule
resethardware defaults force relay and actuator enables off
bootloaderoutputs remain disabled and are not configured as active drivers
receiving updateoutputs remain disabled; measurement functions may be limited
first candidate bootoutputs remain disabled until self-test and interlock checks pass
rollbackoutputs remain disabled until old firmware enters normal operational mode
service modeoutputs disabled; diagnostic communication allowed

Engineering Comment

Safe update design should not rely only on application firmware. Reset defaults, pull-downs, interlocks, and power-stage enables must be reviewed because the bootloader and partially configured firmware run before the normal safety logic is active.

Step 8: Validation Matrix

Run interruption tests on representative hardware, not only in simulation.

TestAcceptance criterion
interrupt transfer at 0\%, 25\%, 50\%, 90\%, and 99\%previous active image boots and reports rejected candidate
corrupt one image blockbootloader rejects candidate before activation
power loss during metadata commitbootloader selects old valid metadata or service mode, never unknown image
power loss during first candidate bootcandidate remains pending or rolls back according to boot counter rule
candidate watchdog reset during probationold image boots and reset cause is recorded
incompatible configuration formatcandidate is rejected or migration evidence is attached
calibration copy A corruptedcopy B is used and fault is logged
both image slots invalidservice mode starts with outputs disabled
update with high bus traffictransfer retry logic does not starve critical diagnostics
brown-out near flash writecritical writes are blocked or recoverable

The test log should record firmware versions, hardware revision, power supply setup, interruption timing, reset cause, selected boot path, active slot, and output state.

Step 9: Risk Review

Initial failure mode:

Failure modeCauseEffectInitial rating
incomplete firmware image becomes activeinterrupted update plus weak metadatadevice does not boot, service visit required, outputs may be ambiguousS=9,\ O=3,\ D=6

Initial risk priority number:

RPN_{initial}=9(3)(6)=162

After redesign:

ControlEffect
dual application slotsold image remains available during transfer
image integrity checkcorrupted candidate is rejected
transactional metadatainterrupted state is detectable
probationary first bootfailed candidate rolls back
safe hardware output defaultsupdate and boot states do not energize actuators
interruption validation matrixrecovery behavior is demonstrated

Residual rating:

RPN_{residual}=9(1)(2)=18

Engineering Comment

Severity remains high because a bricked or unsafe controller is still a serious failure. The design reduces occurrence and improves detection. That is the right interpretation of the risk reduction.

Final Release Package

The project is complete when the release package contains:

  1. memory map with slot sizes and growth threshold;
  2. bootloader state diagram or decision table;
  3. metadata format and dual-copy recovery rule;
  4. image integrity and compatibility checks;
  5. calibration and configuration migration policy;
  6. safe-output evidence for reset, bootloader, update, rollback, and service modes;
  7. power-loss and interrupted-transfer test report;
  8. rollback test report after candidate failure;
  9. field diagnostic fields and service-tool screenshots or logs;
  10. release decision with known limits and update instructions.

The release decision should not say only that the new firmware works. It should say that the product remains recoverable when the new firmware, update transfer, power supply, or metadata write fails.

Common Engineering Mistakes

Common mistakes include testing only successful updates, erasing the old image too early, using a single update-complete flag, ignoring configuration rollback, and assuming a watchdog reset is a complete recovery plan.

Other mistakes include validating the update path on a development board instead of production hardware, leaving outputs undefined in the bootloader, forgetting brown-out behavior during flash writes, omitting update status from field logs, and releasing a new firmware image without proving how to return to the old one.

The practical lesson is that firmware update is part of product reliability. A maintainable embedded product can accept new code, reject bad code, recover from interruption, preserve controlled data, and explain its own update state to the engineers who must support it.

REF

See also