Glossary term
Livelock
Engineering definition of livelock covering active execution without progress, retry loops, rollback oscillation, backoff, progress metrics and validation evidence.
Definition
phenomenonA livelock is a condition in which tasks, processes or services keep executing and reacting, but their repeated actions prevent useful progress.
Livelock appears in concurrent software, distributed systems, embedded controllers, retry loops, conflict-resolution protocols and resource arbitration when actors continuously yield, retry, roll back, reschedule or compensate in response to each other. A useful analysis states the progress metric, repeated state transition, retry or rollback rule, collision pattern, backoff policy, timeout behavior, recovery action and validation evidence.
A livelock is a condition in which tasks, processes or services keep executing and reacting, but their repeated actions prevent useful progress. Unlike deadlock, the actors are not simply blocked. They are active, consuming CPU, network, bus, queue or control-loop capacity while the desired state is never reached.
Livelock appears in concurrent algorithms, distributed services, retry loops, conflict-resolution protocols, embedded recovery routines and resource arbitration. It is especially easy to miss because dashboards can show activity, log volume and retry attempts while the useful completion rate is near zero.
Activity Without Progress
Let the number of attempts during an observation window be:
and the number of completed useful operations be:
A simple progress efficiency is:
For a healthy system, attempts should eventually create completions. Livelock risk is present when:
but:
or:
The progress metric must be tied to the engineering objective. A service may count committed requests. A controller may count successful state transitions. A distributed protocol may count elected leaders, accepted writes or completed consensus rounds.
Repeated State Transitions
Many livelocks are visible as oscillation among a small set of states:
The system is changing state, but the sequence does not reach the target:
Common causes include symmetric collision handling, both peers backing off in the same way, transactions repeatedly aborting each other, two controllers compensating for each other, or recovery logic that immediately recreates the triggering condition.
Retry and Rollback Loops
If attempts are:
and aborts or rollbacks are:
then the abort ratio is:
A high abort ratio is not always a livelock, but it becomes suspicious when combined with low completion rate and repeated conflict causes. Retry loops can create livelock when each actor sees the same failure, retries at the same time and collides again.
Retry budgets, idempotency, cancellation propagation and circuit breakers are useful only when they change the loop. A retry policy that repeats the same collision at higher speed is an amplifier, not a recovery design.
Backoff and Asymmetry
Livelock mitigation often adds asymmetry. Randomized backoff, priority tie-breakers, leader election, single-owner queues, ordered resource acquisition or bounded retries can break repeated symmetry.
For two actors with retry period:
and jitter range:
the design should avoid deterministic alignment:
when synchronized retries caused the livelock. A useful release check measures escape time:
and requires:
where T_allowed comes from service, safety or recovery requirements.
Worked Example
Two workers process the same conflicting record. During a 20 second fault-injection run, telemetry records:
attempts and:
successful commits. The progress efficiency is:
or 2.5 percent. Abort logs show:
so:
or 94.2 percent. The workers are active, but nearly all useful work is cancelled by repeated conflict.
After adding randomized backoff and a deterministic tie-breaker, the same test records:
and:
The new progress efficiency is:
The mitigation is credible only if traces show that the repeated conflict state disappeared, not merely that the test ran at a lower load.
Detection and Recovery
Detection should track useful completion, not just activity. Useful evidence includes retry count, abort cause, state-transition sequence, conflict key, queue age, lock owner, timeout reason, cancellation path, CPU usage, network traffic, watchdog behavior and final state.
Recovery can include randomized backoff, deterministic ordering, leader election, fencing, retry caps, fail-fast behavior, circuit breaking, dead-letter routing, single-writer ownership or safe-state transition. The recovery must preserve consistency. Breaking a livelock by cancelling both actors may stop the loop but still lose the command, duplicate work or leave equipment in the wrong state.
Relationship To Neighbor Terms
Deadlock is waiting without progress. Livelock is action without useful progress. Task starvation is a runnable task missing service while other work progresses. Lock contention is ordinary waiting for a lock that eventually releases. A retry storm is excessive retry traffic; it can cause or accompany livelock, but livelock specifically requires repeated active behavior that prevents completion.
In real systems these patterns can combine. A service can enter a retry storm, saturate a thread pool, starve low-priority work and then livelock on repeated rollback. The diagnosis should separate the first trigger from the sustaining loop.
Common Mistakes
The most common mistake is treating high activity as health. Another is proving that no thread is blocked and assuming progress is guaranteed. A third is adding retries without a budget, backoff or state change. A fourth is using a watchdog reset as the only recovery and losing the evidence needed to identify the loop.
A strong livelock review states the progress metric, repeated transition, collision rule, retry or rollback policy, mitigation, maximum escape time and validation evidence.