Project
Firmware Update Rollback and Power-Loss Recovery Project
Computer engineering project for designing and validating a safe firmware-update and rollback mechanism with image slots, bootloader decisions, power-loss recovery, metadata integrity, fault tests, and release criteria.
This project designs and validates a firmware-update mechanism for an embedded controller that must survive interrupted transfers, power loss, corrupted images, incompatible configuration data, and failed first boot. The goal is not only to install new code. The goal is to prove that a failed update does not leave the product unusable or unsafe.
Firmware update reliability is a systems problem. It involves flash memory layout, bootloader behavior, image validation, metadata transactions, brown-out response, safe outputs, communication errors, service tools, version reporting, and field diagnostics. A robust design makes the update path recoverable before deployment, not after failed units appear in service.
Project Objective
Produce an engineering update-and-rollback package for a microcontroller-based product. The final deliverable should answer:
- Which image is active, which image is staged, and which image is trusted?
- What happens if power is lost during transfer, validation, metadata commit, first boot, or rollback?
- How is image integrity checked before any jump to application code?
- How are calibration and configuration data preserved across update and rollback?
- Which outputs remain safe during bootloader, update, reset, and recovery states?
- Which tests prove that the product can recover from every credible interrupted-update state?
- What evidence is required before the firmware package is released to field service?
The deliverable is a design-review package. It should include memory layout, bootloader state logic, power-loss test matrix, acceptance criteria, and release records.
Baseline Scenario
An industrial measurement controller uses a microcontroller with internal flash, a serial service port, analog inputs, digital outputs, and a relay interlock. The controller is installed in equipment that may lose power without warning. Service technicians need a field-update path because calibration features and communication behavior are updated after deployment.
The first release used a single application image. If an update was interrupted, the bootloader could not always distinguish an incomplete image from a valid one. The redesign introduces dual application slots, transactional metadata, integrity checks, and a probationary first-boot period.
Design Requirements
| Requirement | Acceptance criterion |
|---|---|
| Existing firmware remains bootable during transfer | active image is not erased until staged image is valid |
| Corrupted candidate image is rejected | integrity check fails before activation |
| Power loss is recoverable | every tested interruption state boots active image, staged image, or service mode intentionally |
| Calibration data are preserved | calibration checksum and version remain valid after update and rollback |
| Outputs remain safe | relay interlock and actuator outputs remain disabled in bootloader and update mode |
| Failed new firmware rolls back | watchdog or application confirmation failure returns to prior valid image |
| Field diagnosis is possible | device reports active version, previous version, reset cause, and update result |
The requirements separate update success from recovery success. A failed update is acceptable only if the device reaches a known safe state with diagnosable evidence.
Step 1: Define Flash Memory Layout
Use a dual-slot layout so the active image is preserved while the candidate image is received.
| Region | Size |
|---|---|
| bootloader and service monitor | 96\ \text{kB} |
| application slot A | 384\ \text{kB} |
| application slot B | 384\ \text{kB} |
| configuration and calibration, dual copy | 32\ \text{kB} |
| event log | 96\ \text{kB} |
| diagnostic scratch and reserved area | 32\ \text{kB} |
| total flash | 1024\ \text{kB} |
Check that the layout fits:
The candidate firmware image is:
Slot size is:
Slot margin is:
Therefore:
Engineering Comment
The image fits, but the margin is not large. The project should record a growth limit or require a memory review when the image exceeds a threshold, for example 360\ \text{kB}. Without a threshold, diagnostic code, security checks, or communication libraries can silently consume the reserve needed for safe future updates.
Step 2: Estimate Transfer Time and Exposure
The service link has an effective payload bandwidth of:
Transfer time is:
If verification and metadata checks require another 2.5\ \text{s}, the update exposure time is approximately:
Engineering Comment
The product must tolerate power loss during this whole interval. The design should assume interruption is normal, not exceptional. A field update that only works when power is stable is not a reliable update mechanism.
Step 3: Define Update Metadata as a Transaction
The bootloader should not rely on a single writable flag such as new_image_ready. Metadata should make partial writes detectable.
Use two metadata copies with:
| Field | Purpose |
|---|---|
| slot identifier | states whether slot A or slot B is active |
| image size | prevents executing unbounded or incomplete content |
| image integrity value | rejects corrupted image data |
| image version | supports service diagnosis and compatibility checks |
| configuration format version | prevents incompatible data interpretation |
| boot attempt counter | supports probation and rollback |
| commit marker | distinguishes complete metadata from interrupted write |
| metadata integrity value | detects corrupted metadata |
The bootloader accepts metadata only if the record is complete, internally consistent, and newer than the alternate copy.
Engineering Comment
Transactional metadata is more important than the exact field names. The engineering rule is that any interrupted write must look either like the old valid state or a rejected invalid state. It must not look like a valid command to boot unknown code.
Step 4: Check Brown-Out Hold-Up for Metadata Writes
Assume the controller has a local capacitor:
The brown-out supervisor detects falling supply at:
The minimum voltage for reliable flash programming is:
Available capacitor energy is:
During a flash metadata write, assume:
and average voltage:
Load power is:
Estimated hold-up time is:
Therefore:
If the metadata write time is:
then:
Engineering Comment
The hardware cannot guarantee that an in-progress metadata write finishes after power loss. The design must therefore use dual-copy metadata, commit markers, and brown-out blocking. The firmware should also refuse to start a critical write if supply voltage is already near the brown-out threshold.
Step 5: Define Bootloader Decision Logic
The bootloader should make a deterministic choice before any application code runs.
| Condition at boot | Bootloader action |
|---|---|
| active slot valid and no candidate pending | boot active slot |
| candidate fully received and integrity valid | mark candidate as pending confirmation, then boot candidate |
| candidate incomplete or integrity invalid | reject candidate and boot previous active slot |
| candidate booted but did not confirm before timeout | roll back to previous active slot |
| both application slots invalid | remain in service mode with outputs disabled |
| metadata copies disagree | use newest valid copy, otherwise enter service mode |
| configuration version incompatible with candidate | reject candidate unless migration test is approved |
The application must explicitly confirm that it has reached a safe operational state. Confirmation should occur only after initialization, configuration loading, basic self-test, and safe-output checks are complete.
Engineering Comment
Immediate activation is risky. A candidate image that starts executing is not necessarily safe. Probationary boot makes the new firmware prove that it can initialize, read configuration, supervise outputs, and communicate before it becomes permanent.
Step 6: Preserve Calibration and Configuration
Calibration data should not be rewritten casually during firmware update. Treat it as controlled engineering data.
Minimum checks:
- calibration copy A and copy B both include version, length, and integrity value;
- the new firmware declares which calibration format versions it can read;
- any migration has a reversible or backed-up path;
- rollback can still interpret the stored configuration;
- service tools can report calibration version and update result.
If the new firmware requires a configuration migration that the old firmware cannot read, rollback may be blocked. In that case the release package must either make the migration backward compatible or treat rollback as a controlled service action, not an automatic promise.
Step 7: Safe Outputs During Update
The update mode must not energize outputs by accident.
Use a conservative output contract:
| State | Output rule |
|---|---|
| reset | hardware defaults force relay and actuator enables off |
| bootloader | outputs remain disabled and are not configured as active drivers |
| receiving update | outputs remain disabled; measurement functions may be limited |
| first candidate boot | outputs remain disabled until self-test and interlock checks pass |
| rollback | outputs remain disabled until old firmware enters normal operational mode |
| service mode | outputs disabled; diagnostic communication allowed |
Engineering Comment
Safe update design should not rely only on application firmware. Reset defaults, pull-downs, interlocks, and power-stage enables must be reviewed because the bootloader and partially configured firmware run before the normal safety logic is active.
Step 8: Validation Matrix
Run interruption tests on representative hardware, not only in simulation.
| Test | Acceptance criterion |
|---|---|
| interrupt transfer at 0\%, 25\%, 50\%, 90\%, and 99\% | previous active image boots and reports rejected candidate |
| corrupt one image block | bootloader rejects candidate before activation |
| power loss during metadata commit | bootloader selects old valid metadata or service mode, never unknown image |
| power loss during first candidate boot | candidate remains pending or rolls back according to boot counter rule |
| candidate watchdog reset during probation | old image boots and reset cause is recorded |
| incompatible configuration format | candidate is rejected or migration evidence is attached |
| calibration copy A corrupted | copy B is used and fault is logged |
| both image slots invalid | service mode starts with outputs disabled |
| update with high bus traffic | transfer retry logic does not starve critical diagnostics |
| brown-out near flash write | critical writes are blocked or recoverable |
The test log should record firmware versions, hardware revision, power supply setup, interruption timing, reset cause, selected boot path, active slot, and output state.
Step 9: Risk Review
Initial failure mode:
| Failure mode | Cause | Effect | Initial rating |
|---|---|---|---|
| incomplete firmware image becomes active | interrupted update plus weak metadata | device does not boot, service visit required, outputs may be ambiguous | S=9,\ O=3,\ D=6 |
Initial risk priority number:
After redesign:
| Control | Effect |
|---|---|
| dual application slots | old image remains available during transfer |
| image integrity check | corrupted candidate is rejected |
| transactional metadata | interrupted state is detectable |
| probationary first boot | failed candidate rolls back |
| safe hardware output defaults | update and boot states do not energize actuators |
| interruption validation matrix | recovery behavior is demonstrated |
Residual rating:
Engineering Comment
Severity remains high because a bricked or unsafe controller is still a serious failure. The design reduces occurrence and improves detection. That is the right interpretation of the risk reduction.
Final Release Package
The project is complete when the release package contains:
- memory map with slot sizes and growth threshold;
- bootloader state diagram or decision table;
- metadata format and dual-copy recovery rule;
- image integrity and compatibility checks;
- calibration and configuration migration policy;
- safe-output evidence for reset, bootloader, update, rollback, and service modes;
- power-loss and interrupted-transfer test report;
- rollback test report after candidate failure;
- field diagnostic fields and service-tool screenshots or logs;
- release decision with known limits and update instructions.
The release decision should not say only that the new firmware works. It should say that the product remains recoverable when the new firmware, update transfer, power supply, or metadata write fails.
Common Engineering Mistakes
Common mistakes include testing only successful updates, erasing the old image too early, using a single update-complete flag, ignoring configuration rollback, and assuming a watchdog reset is a complete recovery plan.
Other mistakes include validating the update path on a development board instead of production hardware, leaving outputs undefined in the bootloader, forgetting brown-out behavior during flash writes, omitting update status from field logs, and releasing a new firmware image without proving how to return to the old one.
The practical lesson is that firmware update is part of product reliability. A maintainable embedded product can accept new code, reject bad code, recover from interruption, preserve controlled data, and explain its own update state to the engineers who must support it.