Case study

CAN Bus Arbitration Latency Deadline Miss Case Study

Computer engineering case study on CAN bus arbitration latency, frame timing, bus utilization, worst-case response time, diagnostic traffic, deadline miss, corrective priorities, and validation evidence.

This case study analyzes an embedded control system that missed a real-time communication deadline after a firmware update added high-priority diagnostic traffic to a shared CAN bus. The control software still executed on time, but the command frame waited too long for bus access because arbitration favored other messages.

The case is useful because shared buses are often reviewed by average utilization. Real-time systems need worst-case response time, arbitration priority, blocking, bursts, error recovery, and validation evidence. A bus with acceptable average load can still miss a hard deadline.

Case Summary

ItemEngineering relevance
SystemDistributed embedded controller with a shared CAN bus.
Bus rate500\ \text{kbit/s}
Critical messageactuator command frame with 2\ \text{ms} deadline
Triggerfirmware update added frequent high-priority diagnostic frames
Symptomactuator command occasionally arrived late during diagnostic mode
Root causearbitration priority and diagnostic period were not included in worst-case bus timing analysis
Corrective actionlower diagnostic priority, rate-limit diagnostics, define service mode, and validate worst-case response time

The example uses a simplified CAN timing model. Real analysis should use exact frame format, identifier length, bit stuffing bound, error frames, retransmission behavior, oscillator tolerance, bus physical layer, transceiver delay, gateway behavior, and safety requirements.

Field Data

The bus carries a critical actuator command and several periodic messages.

MessageIdentifier priorityPeriodFrame timeDeadline
safety heartbeathigher than command10\ \text{ms}0.27\ \text{ms}10\ \text{ms}
inverter statushigher than command5\ \text{ms}0.27\ \text{ms}5\ \text{ms}
diagnostic stream after updatehigher than command0.5\ \text{ms}0.27\ \text{ms}service-mode only
actuator commandtarget message10\ \text{ms}0.27\ \text{ms}2\ \text{ms}
lower-priority telemetrylower than commandmixed0.27\ \text{ms}noncritical

CAN arbitration is non-destructive: the frame with the highest priority identifier wins bus access. A lower-priority frame already in transmission cannot be pre-empted, so one lower-priority frame can block a higher-priority frame until it finishes.

Step 1: Estimate Frame Transmission Time

Use a conservative frame length including arbitration, control, data, CRC, acknowledgement, inter-frame space, and bit-stuffing allowance:

L_f=135\ \text{bit}

Bus rate:

R_b=500000\ \text{bit/s}

Frame transmission time:

\displaystyle C=\frac{L_f}{R_b}
\displaystyle C=\frac{135}{500000}=0.000270\ \text{s}=0.27\ \text{ms}

Engineering Comment

The frame time is not only the payload size divided by bit rate. Protocol overhead and bit stuffing matter. For release analysis, use the exact frame type and a justified worst-case bound.

Step 2: Check Bus Utilization

For a periodic message:

\displaystyle U_i=\frac{C_i}{T_i}

Heartbeat utilization:

\displaystyle U_h=\frac{0.27}{10}=2.7\%

Inverter status utilization:

\displaystyle U_s=\frac{0.27}{5}=5.4\%

Diagnostic utilization:

\displaystyle U_d=\frac{0.27}{0.5}=54\%

Actuator command utilization:

\displaystyle U_c=\frac{0.27}{10}=2.7\%

Subtotal for these messages:

U_{subtotal}=2.7+5.4+54+2.7=64.8\%

Lower-priority telemetry and error recovery add more load, so the observed peak bus load near diagnostic mode is plausible.

Engineering Comment

The utilization is high but not above 100\%. That alone does not prove schedulability. The deadline miss comes from priority and phasing, not only total load.

Step 3: Calculate Response Time Without Diagnostic Stream

For a fixed-priority non-preemptive bus screen, response time for the command frame can be estimated by:

\displaystyle R_i=C_i+B_i+\sum_{j\in hp(i)}\left\lceil\frac{R_i}{T_j}\right\rceil C_j

where:

  • C_i is command frame time;
  • B_i is blocking by one lower-priority frame already on the bus;
  • hp(i) is the set of higher-priority messages;
  • T_j is the period of higher-priority message j.

Use:

C_i=0.27\ \text{ms}

and one lower-priority blocking frame:

B_i=0.27\ \text{ms}

Without the diagnostic stream, higher-priority messages are heartbeat and inverter status.

First iteration:

\displaystyle R_i=0.27+0.27+\left\lceil\frac{0.54}{10}\right\rceil0.27+\left\lceil\frac{0.54}{5}\right\rceil0.27
R_i=0.54+0.27+0.27=1.08\ \text{ms}

Repeating with R_i=1.08\ \text{ms} gives the same interference counts:

R_i=1.08\ \text{ms}

The command deadline is:

D_i=2.0\ \text{ms}

So the original configuration passes:

1.08<2.0

Engineering Comment

Before the firmware update, the bus had enough arbitration margin for the command frame. This is why the issue did not appear in earlier bench tests.

Step 4: Calculate Response Time With Diagnostic Stream

After the update, the diagnostic stream has higher priority than the command and period:

T_d=0.5\ \text{ms}

Add its interference term:

\displaystyle R_i=0.27+0.27+\left\lceil\frac{R_i}{10}\right\rceil0.27+\left\lceil\frac{R_i}{5}\right\rceil0.27+\left\lceil\frac{R_i}{0.5}\right\rceil0.27

Start with:

R_i=0.54\ \text{ms}

First evaluation:

\displaystyle R_i=0.54+0.27+0.27+\left\lceil\frac{0.54}{0.5}\right\rceil0.27
R_i=1.08+2(0.27)=1.62\ \text{ms}

Second evaluation:

\displaystyle R_i=1.08+\left\lceil\frac{1.62}{0.5}\right\rceil0.27
R_i=1.08+4(0.27)=2.16\ \text{ms}

Third evaluation:

\displaystyle R_i=1.08+\left\lceil\frac{2.16}{0.5}\right\rceil0.27
R_i=1.08+5(0.27)=2.43\ \text{ms}

Repeating remains at:

R_i=2.43\ \text{ms}

The deadline is:

D_i=2.0\ \text{ms}

Therefore:

R_i>D_i

The command can miss its deadline during diagnostic mode.

Engineering Comment

The result explains the field symptom. The command task did not necessarily run late. The command frame became late after it was ready because higher-priority diagnostic traffic repeatedly won arbitration.

Step 5: Identify the Design Error

The diagnostic firmware update made two unsafe assumptions:

  1. diagnostic traffic was treated as harmless because it was “only messages”;
  2. priority identifiers were assigned for convenience, not deadline consequence.

The update created a priority inversion at the bus level. Noncritical diagnostic traffic had higher arbitration priority than a time-critical actuator command.

This is not the same as CPU priority inversion, but the engineering pattern is similar: a less important activity delayed a more important deadline because the shared resource policy was wrong.

Step 6: Correct the Message Set

The corrected design moved diagnostic frames to lower priority than control frames and limited the diagnostic period in normal operation.

New diagnostic period:

T_d=20\ \text{ms}

New diagnostic priority: lower than the actuator command, so it does not appear in hp(i) for the command response-time calculation.

Command response time returns to:

R_i=1.08\ \text{ms}

If a service mode requires faster diagnostics, the mode must explicitly relax the actuator command requirement, inhibit active control, or run with a separate validation case.

Engineering Comment

The key correction is not only reducing bus load. It is aligning arbitration priority with real-time consequence. Critical control frames must not wait behind noncritical diagnostics.

Step 7: Check Remaining Bus Load

In normal operation after correction:

\displaystyle U_d=\frac{0.27}{20}=1.35\%

Subtotal utilization becomes:

U_{subtotal}=2.7+5.4+1.35+2.7=12.15\%

This leaves capacity for lower-priority telemetry, retransmissions, and diagnostic bursts under controlled mode rules.

Engineering Comment

Low average utilization is helpful, but it is still not a full proof. Worst-case response time, error frames, gateways, interrupt service time, receive queue depth, and fault recovery still need validation.

Corrective Actions

The accepted corrective actions were:

  1. reserve highest arbitration priority for safety and control deadlines;
  2. move diagnostic and logging messages below control messages;
  3. rate-limit diagnostics during active control;
  4. create a service mode for high-rate diagnostics with explicit operating constraints;
  5. add bus response-time analysis to firmware release review;
  6. measure actual frame timing with a bus analyzer;
  7. test under maximum periodic load, diagnostic load, and error-recovery cases;
  8. monitor receive-queue high-water marks and dropped-frame counters;
  9. require rollback if bus deadline evidence is absent after a firmware change.

Validation Evidence

The corrected release should include:

  • message database with identifier, period, deadline, frame length, and owner;
  • worst-case response-time calculation for every hard-deadline frame;
  • measured bus load in normal, startup, diagnostic, degraded, and fault-recovery modes;
  • bus analyzer trace proving command frame latency below 2\ \text{ms};
  • receive-queue occupancy and interrupt-load measurements;
  • electromagnetic-interference and error-frame test results where relevant;
  • bus-off recovery test;
  • firmware configuration record matching the tested message set;
  • regression test that fails if diagnostics regain higher priority than control.

Final Decision

The defensible engineering decision was:

Do not release the diagnostic firmware update until arbitration priority, diagnostic rate limiting, bus response-time analysis, and measured bus traces prove the actuator command deadline.

The main lesson is that a real-time data bus is a scheduled resource. Bandwidth, arbitration priority, frame length, burst behavior, and error recovery must be treated as part of the timing budget, not as an implementation detail after the control software is complete.

REF

See also