Project

Firmware Update Rollback and Power-Loss Recovery Project

Computer engineering project for designing and validating a safe firmware-update and rollback mechanism with image slots, bootloader decisions, power-loss recovery, metadata integrity, fault tests, and release criteria.

Branch: Computer Engineering
Content: Project
Updated: Jun 22, 2026
Revision: v1.0.0 · reviewed

This project designs and validates a firmware-update mechanism for an embedded controller that must survive interrupted transfers, power loss, corrupted images, incompatible configuration data, and failed first boot. The goal is not only to install new code. The goal is to prove that a failed update does not leave the product unusable or unsafe.

Firmware update reliability is a systems problem. It involves flash memory layout, bootloader behavior, image validation, metadata transactions, brown-out response, safe outputs, communication errors, service tools, version reporting, and field diagnostics. A robust design makes the update path recoverable before deployment, not after failed units appear in service.

Project Objective

Produce an engineering update-and-rollback package for a microcontroller-based product. The final deliverable should answer:

Which image is active, which image is staged, and which image is trusted?
What happens if power is lost during transfer, validation, metadata commit, first boot, or rollback?
How is image integrity checked before any jump to application code?
How are calibration and configuration data preserved across update and rollback?
Which outputs remain safe during bootloader, update, reset, and recovery states?
Which tests prove that the product can recover from every credible interrupted-update state?
What evidence is required before the firmware package is released to field service?

The deliverable is a design-review package. It should include memory layout, bootloader state logic, power-loss test matrix, acceptance criteria, and release records.

Baseline Scenario

An industrial measurement controller uses a microcontroller with internal flash, a serial service port, analog inputs, digital outputs, and a relay interlock. The controller is installed in equipment that may lose power without warning. Service technicians need a field-update path because calibration features and communication behavior are updated after deployment.

The first release used a single application image. If an update was interrupted, the bootloader could not always distinguish an incomplete image from a valid one. The redesign introduces dual application slots, transactional metadata, integrity checks, and a probationary first-boot period.

Design Requirements

Requirement	Acceptance criterion
Existing firmware remains bootable during transfer	active image is not erased until staged image is valid
Corrupted candidate image is rejected	integrity check fails before activation
Power loss is recoverable	every tested interruption state boots active image, staged image, or service mode intentionally
Calibration data are preserved	calibration checksum and version remain valid after update and rollback
Outputs remain safe	relay interlock and actuator outputs remain disabled in bootloader and update mode
Failed new firmware rolls back	watchdog or application confirmation failure returns to prior valid image
Field diagnosis is possible	device reports active version, previous version, reset cause, and update result

The requirements separate update success from recovery success. A failed update is acceptable only if the device reaches a known safe state with diagnosable evidence.

Step 1: Define Flash Memory Layout

Use a dual-slot layout so the active image is preserved while the candidate image is received.

Region	Size
bootloader and service monitor	$96\ \text{kB}$
application slot A	$384\ \text{kB}$
application slot B	$384\ \text{kB}$
configuration and calibration, dual copy	$32\ \text{kB}$
event log	$96\ \text{kB}$
diagnostic scratch and reserved area	$32\ \text{kB}$
total flash	$1024\ \text{kB}$

Check that the layout fits:

96+384+384+32+96+32=1024\ \text{kB}

The candidate firmware image is:

S_{image}=332\ \text{kB}

Slot size is:

S_{slot}=384\ \text{kB}

Slot margin is:

\displaystyle M_{slot}=\frac{S_{slot}-S_{image}}{S_{slot}}

\displaystyle M_{slot}=\frac{384-332}{384}=0.135

Therefore:

M_{slot}=13.5\%

Engineering Comment

The image fits, but the margin is not large. The project should record a growth limit or require a memory review when the image exceeds a threshold, for example $360\ \text{kB}$ . Without a threshold, diagnostic code, security checks, or communication libraries can silently consume the reserve needed for safe future updates.

Step 2: Estimate Transfer Time and Exposure

The service link has an effective payload bandwidth of:

B_{eff}=24\ \text{kB/s}

Transfer time is:

\displaystyle t_{transfer}=\frac{S_{image}}{B_{eff}}

\displaystyle t_{transfer}=\frac{332}{24}=13.8\ \text{s}

If verification and metadata checks require another $2.5\ \text{s}$ , the update exposure time is approximately:

t_{exposure}=13.8+2.5=16.3\ \text{s}

Engineering Comment

The product must tolerate power loss during this whole interval. The design should assume interruption is normal, not exceptional. A field update that only works when power is stable is not a reliable update mechanism.

Step 3: Define Update Metadata as a Transaction

The bootloader should not rely on a single writable flag such as new_image_ready. Metadata should make partial writes detectable.

Use two metadata copies with:

Field	Purpose
slot identifier	states whether slot A or slot B is active
image size	prevents executing unbounded or incomplete content
image integrity value	rejects corrupted image data
image version	supports service diagnosis and compatibility checks
configuration format version	prevents incompatible data interpretation
boot attempt counter	supports probation and rollback
commit marker	distinguishes complete metadata from interrupted write
metadata integrity value	detects corrupted metadata

The bootloader accepts metadata only if the record is complete, internally consistent, and newer than the alternate copy.

Engineering Comment

Transactional metadata is more important than the exact field names. The engineering rule is that any interrupted write must look either like the old valid state or a rejected invalid state. It must not look like a valid command to boot unknown code.

Step 4: Check Brown-Out Hold-Up for Metadata Writes

Assume the controller has a local capacitor:

C=470\ \mu\text{F}

The brown-out supervisor detects falling supply at:

V_{high}=3.3\ \text{V}

The minimum voltage for reliable flash programming is:

V_{low}=2.7\ \text{V}

Available capacitor energy is:

\displaystyle E=\frac{1}{2}C(V_{high}^2-V_{low}^2)

\displaystyle E=\frac{1}{2}(470\times10^{-6})(3.3^2-2.7^2)

E=0.000846\ \text{J}

During a flash metadata write, assume:

I=80\ \text{mA}

and average voltage:

V_{avg}=3.0\ \text{V}

Load power is:

P=IV=0.080(3.0)=0.240\ \text{W}

Estimated hold-up time is:

\displaystyle t_{hold}=\frac{E}{P}=\frac{0.000846}{0.240}=0.00353\ \text{s}

Therefore:

t_{hold}=3.5\ \text{ms}

If the metadata write time is:

t_{write}=8\ \text{ms}

then:

t_{hold}<t_{write}

Engineering Comment

The hardware cannot guarantee that an in-progress metadata write finishes after power loss. The design must therefore use dual-copy metadata, commit markers, and brown-out blocking. The firmware should also refuse to start a critical write if supply voltage is already near the brown-out threshold.

Step 5: Define Bootloader Decision Logic

The bootloader should make a deterministic choice before any application code runs.

Condition at boot	Bootloader action
active slot valid and no candidate pending	boot active slot
candidate fully received and integrity valid	mark candidate as pending confirmation, then boot candidate
candidate incomplete or integrity invalid	reject candidate and boot previous active slot
candidate booted but did not confirm before timeout	roll back to previous active slot
both application slots invalid	remain in service mode with outputs disabled
metadata copies disagree	use newest valid copy, otherwise enter service mode
configuration version incompatible with candidate	reject candidate unless migration test is approved

The application must explicitly confirm that it has reached a safe operational state. Confirmation should occur only after initialization, configuration loading, basic self-test, and safe-output checks are complete.

Engineering Comment

Immediate activation is risky. A candidate image that starts executing is not necessarily safe. Probationary boot makes the new firmware prove that it can initialize, read configuration, supervise outputs, and communicate before it becomes permanent.

Step 6: Preserve Calibration and Configuration

Calibration data should not be rewritten casually during firmware update. Treat it as controlled engineering data.

Minimum checks:

calibration copy A and copy B both include version, length, and integrity value;
the new firmware declares which calibration format versions it can read;
any migration has a reversible or backed-up path;
rollback can still interpret the stored configuration;
service tools can report calibration version and update result.

If the new firmware requires a configuration migration that the old firmware cannot read, rollback may be blocked. In that case the release package must either make the migration backward compatible or treat rollback as a controlled service action, not an automatic promise.

Step 7: Safe Outputs During Update

The update mode must not energize outputs by accident.

Use a conservative output contract:

State	Output rule
reset	hardware defaults force relay and actuator enables off
bootloader	outputs remain disabled and are not configured as active drivers
receiving update	outputs remain disabled; measurement functions may be limited
first candidate boot	outputs remain disabled until self-test and interlock checks pass
rollback	outputs remain disabled until old firmware enters normal operational mode
service mode	outputs disabled; diagnostic communication allowed

Engineering Comment

Safe update design should not rely only on application firmware. Reset defaults, pull-downs, interlocks, and power-stage enables must be reviewed because the bootloader and partially configured firmware run before the normal safety logic is active.

Step 8: Validation Matrix

Run interruption tests on representative hardware, not only in simulation.

Test	Acceptance criterion
interrupt transfer at $0\%$ , $25\%$ , $50\%$ , $90\%$ , and $99\%$	previous active image boots and reports rejected candidate
corrupt one image block	bootloader rejects candidate before activation
power loss during metadata commit	bootloader selects old valid metadata or service mode, never unknown image
power loss during first candidate boot	candidate remains pending or rolls back according to boot counter rule
candidate watchdog reset during probation	old image boots and reset cause is recorded
incompatible configuration format	candidate is rejected or migration evidence is attached
calibration copy A corrupted	copy B is used and fault is logged
both image slots invalid	service mode starts with outputs disabled
update with high bus traffic	transfer retry logic does not starve critical diagnostics
brown-out near flash write	critical writes are blocked or recoverable

The test log should record firmware versions, hardware revision, power supply setup, interruption timing, reset cause, selected boot path, active slot, and output state.

Step 9: Risk Review

Initial failure mode:

Failure mode	Cause	Effect	Initial rating
incomplete firmware image becomes active	interrupted update plus weak metadata	device does not boot, service visit required, outputs may be ambiguous	$S=9,\ O=3,\ D=6$

Initial risk priority number:

RPN_{initial}=9(3)(6)=162

After redesign:

Control	Effect
dual application slots	old image remains available during transfer
image integrity check	corrupted candidate is rejected
transactional metadata	interrupted state is detectable
probationary first boot	failed candidate rolls back
safe hardware output defaults	update and boot states do not energize actuators
interruption validation matrix	recovery behavior is demonstrated

Residual rating:

RPN_{residual}=9(1)(2)=18

Engineering Comment

Severity remains high because a bricked or unsafe controller is still a serious failure. The design reduces occurrence and improves detection. That is the right interpretation of the risk reduction.

Final Release Package

The project is complete when the release package contains:

memory map with slot sizes and growth threshold;
bootloader state diagram or decision table;
metadata format and dual-copy recovery rule;
image integrity and compatibility checks;
calibration and configuration migration policy;
safe-output evidence for reset, bootloader, update, rollback, and service modes;
power-loss and interrupted-transfer test report;
rollback test report after candidate failure;
field diagnostic fields and service-tool screenshots or logs;
release decision with known limits and update instructions.

The release decision should not say only that the new firmware works. It should say that the product remains recoverable when the new firmware, update transfer, power supply, or metadata write fails.

Common Engineering Mistakes

Common mistakes include testing only successful updates, erasing the old image too early, using a single update-complete flag, ignoring configuration rollback, and assuming a watchdog reset is a complete recovery plan.

Other mistakes include validating the update path on a development board instead of production hardware, leaving outputs undefined in the bootloader, forgetting brown-out behavior during flash writes, omitting update status from field logs, and releasing a new firmware image without proving how to return to the old one.

The practical lesson is that firmware update is part of product reliability. A maintainable embedded product can accept new code, reject bad code, recover from interruption, preserve controlled data, and explain its own update state to the engineers who must support it.

REF

Disciplines

Firmware Update Rollback and Power-Loss Recovery Project

Project Objective

Baseline Scenario

Design Requirements

Step 1: Define Flash Memory Layout

Engineering Comment

Step 2: Estimate Transfer Time and Exposure

Engineering Comment

Step 3: Define Update Metadata as a Transaction

Engineering Comment

Step 4: Check Brown-Out Hold-Up for Metadata Writes

Engineering Comment

Step 5: Define Bootloader Decision Logic

Engineering Comment

Step 6: Preserve Calibration and Configuration

Step 7: Safe Outputs During Update

Engineering Comment

Step 8: Validation Matrix

Step 9: Risk Review

Engineering Comment

Final Release Package

Common Engineering Mistakes

See also