Glossary term
Lock Convoy
Engineering definition of lock convoy covering mutex handoff, wakeup overhead, scheduler preemption, persistent wait queues, throughput loss and validation evidence.
Definition
phenomenonA lock convoy is a performance failure in which threads or tasks repeatedly queue behind a contended lock, and lock release plus scheduler handoff preserves a persistent line of waiters.
Lock convoys appear in operating systems, concurrent services, embedded gateways and thread-pool software when a frequently used mutex combines with blocking wakeups, preemption, long critical sections, unfair handoff or bursty arrival. A useful analysis states the lock hold time, handoff delay, waiter count, queue persistence, context-switch cost, owner preemption, wakeup policy, throughput loss and validation evidence.
A lock convoy is a performance failure in which threads or tasks repeatedly queue behind a contended lock, and lock release plus scheduler handoff preserves a persistent line of waiters. The lock is not only busy. The system has fallen into a pattern where each release feeds the next waiting thread slowly enough that the queue does not drain.
Lock convoys appear in operating systems, concurrent services, embedded gateways, database clients, packet-processing paths and thread-pool software. They are often triggered by a preempted lock owner, a long critical section, a blocking mutex implementation, a wakeup storm, a burst of synchronized workers or a scheduler policy that makes lock ownership handoff expensive.
Convoy Mechanism
Let the number of waiting tasks behind a lock be:
and the critical-section hold time be:
When a lock is released, a waiter must wake, be scheduled, acquire the lock, reload working data and enter the critical section. Let this handoff overhead be:
The convoy service cycle is:
The effective lock throughput becomes:
If handoff overhead is large relative to useful critical-section work, the convoy can dominate throughput even when the protected code is short.
Queue Persistence
For lock-entry demand:
convoy utilization is:
A necessary stability screen is:
When:
the waiter line persists or grows. A short critical section can still fail if every release forces a context switch, cache miss, priority handoff or wakeup delay before useful work resumes.
Handoff and Preemption
Convoys often start when the lock owner is preempted while holding the mutex, or when many workers wake and contend for the same lock after a burst. The system can then spend too much time in scheduler transitions, lock arbitration and cache migration.
The extra convoy delay for a waiter behind:
other waiters is approximated by:
This estimate is simple, but it is useful because it ties tail latency to both hold time and scheduler handoff. Reducing only the critical-section code may not solve the problem if t_h remains dominant.
Worked Example
A service enters a shared metrics lock at:
The measured useful critical-section time is:
With a lightweight uncontended path, handoff overhead is:
so:
and:
The lock is busy but stable in this screen.
During a burst, the owner is often preempted and waiters wake in a convoy. Handoff overhead rises to:
The cycle becomes:
and utilization becomes:
The queue is now unstable. The failure is not explained by critical-section code alone. It comes from the combined hold and handoff cycle.
Mitigation
Mitigations include shortening or removing the critical section, sharding the lock, replacing shared counters with per-thread accumulation, using lock-free or wait-free structures when appropriate, batching updates outside the lock, avoiding blocking calls inside the lock, reducing synchronized wakeups and checking whether the mutex policy is appropriate for the workload.
Real-time systems may need priority inheritance, priority ceiling or nonblocking designs to avoid convoy behavior interacting with deadline tasks. Service systems may need admission control, worker isolation or backpressure so bursts do not wake too many workers into the same serialized path.
Validation Evidence
Validation should include lock owner traces, waiter count over time, hold-time distribution, handoff delay, context-switch count, run-queue length, owner preemption events, CPU migration, cache-miss evidence, p95 and p99 lock wait and throughput before and after mitigation.
A useful convoy test reproduces the burst or scheduling condition that creates the persistent waiter line. A microbenchmark with a quiet scheduler can miss the convoy because it does not reproduce the wakeup, preemption or thread-pool behavior of production.
Relationship To Neighbor Terms
Lock contention is the general case of multiple tasks waiting for a lock. A lock convoy is a specific contention pattern where waiters remain lined up because release and scheduling handoff are costly or unfair. Task starvation can occur if some waiters are repeatedly skipped. Priority inversion can occur if a low-priority lock owner blocks higher-priority work. Livelock is active no-progress behavior, while a convoy still usually completes work, just with poor throughput and tail latency.
Deadlock is different: in a convoy, the lock eventually changes owner. In a deadlock, the waiting cycle cannot resolve under the same state.
Common Mistakes
The most common mistake is looking only at mean lock hold time. Another is adding worker threads, which can increase the number of waiters and make the convoy worse. A third is measuring lock performance without scheduler traces. A fourth is assuming a fair mutex always improves latency; fairness can reduce starvation while still increasing handoff cost.
A strong lock-convoy review states the lock, owner behavior, waiter count, hold time, handoff overhead, scheduler evidence, throughput loss, mitigation and regression test.