Glossary term

Deadlock

Engineering definition of deadlock covering wait-for graphs, circular wait, resource ordering, detection, recovery and validation evidence.

Definition

phenomenon

A deadlock is a condition in which tasks, threads or processes cannot make progress because each is waiting for a resource or event held by another member of the same waiting cycle.

Deadlock appears in operating systems, concurrent services, embedded firmware, distributed protocols, control gateways and resource schedulers when mutual exclusion, hold-and-wait, no preemption and circular wait combine. A useful analysis states the resources, owners, waiters, wait-for cycle, recovery behavior, timeout limits, watchdog behavior, state consistency and validation evidence.

A deadlock is a condition in which tasks, threads or processes cannot make progress because each is waiting for a resource or event held by another member of the same waiting cycle. The system may be busy, idle or partially responsive, but the blocked tasks cannot complete without external intervention or recovery logic.

Deadlocks appear in operating systems, concurrent services, embedded firmware, distributed protocols, control gateways and resource schedulers. They are correctness and availability failures, not only performance problems. A slow lock eventually releases; a true deadlock does not resolve by waiting under the same state.

Wait-For Graph

A deadlock can be represented with a wait-for graph:

G_w=(V,E)

where vertices are tasks, threads or processes, and an edge:

T_i\rightarrow T_j

means task T_i is waiting for a resource held by task T_j.

A deadlock exists when the wait-for graph contains a directed cycle:

T_1\rightarrow T_2\rightarrow \cdots \rightarrow T_k\rightarrow T_1

No task in the cycle can proceed because each required release depends on another blocked task.

Necessary Conditions

Classic deadlock requires four conditions to hold together:

  1. mutual exclusion: at least one resource cannot be shared at the same time;
  2. hold and wait: a task holds one resource while waiting for another;
  3. no preemption: the system cannot safely force release of the held resource;
  4. circular wait: tasks form a waiting cycle.

Breaking any one condition can prevent that class of deadlock. Engineering designs usually target circular wait with resource ordering, or hold-and-wait with acquire-all-or-release behavior.

Resource Ordering

If resources have a strict global order:

R_1<R_2<\cdots<R_n

and every task acquires resources only in increasing order, then circular wait through those resources is prevented.

For a task requesting resources:

R_a,\ R_b

the rule requires:

R_a<R_b

before acquiring R_b while holding R_a. If a task needs the reverse order, it should release, reorder, or use a design that avoids holding both.

Detection and Recovery

Detection scans the wait-for graph for cycles. If cycle count is:

C_w

then:

C_w>0

indicates at least one deadlock candidate. Recovery may abort one task, roll back a transaction, release a resource, restart a component, trip a watchdog, enter degraded mode or move to a safe state.

Recovery time should fit the system objective:

T_{detect}+T_{recover}\leq T_{allowed}

If recovery leaves state inconsistent, the deadlock handling may create a second failure.

Worked Example

Three tasks hold and wait for three resources:

TaskHoldsWaits for
T_1R_AR_B
T_2R_BR_C
T_3R_CR_A

The wait-for edges are:

T_1\rightarrow T_2
T_2\rightarrow T_3
T_3\rightarrow T_1

Together:

T_1\rightarrow T_2\rightarrow T_3\rightarrow T_1

This directed cycle is a deadlock.

Suppose the detection interval is:

T_{detect}=0.25\ \text{s}

and rollback plus restart takes:

T_{recover}=1.40\ \text{s}

The allowed degraded recovery time is:

T_{allowed}=3.00\ \text{s}

The recovery screen is:

0.25+1.40=1.65\ \text{s}\leq3.00\ \text{s}

The timing passes, but the release case still needs state-consistency evidence after rollback.

Prevention

Prevention techniques include global resource ordering, lock hierarchy checks, try-lock with release on failure, bounded wait with safe rollback, single-owner resource design, message passing, transactional isolation, lock-free data structures and eliminating nested resource acquisition.

Timeouts alone do not prove absence of deadlock. A timeout can break a wait, but it can also leave partial state, duplicate work or inconsistent ownership. A timeout policy must define what is rolled back, what is retried, what is cancelled and what evidence proves recovery.

Relationship To Neighbor Terms

Lock contention is waiting on a lock that eventually releases. Deadlock is cyclic waiting that cannot resolve under the same state. Race conditions are timing-dependent correctness failures caused by insufficient ordering; deadlock can happen even when locking is added to prevent races. Watchdog timers can detect stalled execution, but they are recovery mechanisms, not proof that deadlock cannot occur.

Validation Evidence

Validation should include lock-order review, static analysis where available, runtime deadlock detection, fault injection, cancellation tests, timeout tests, watchdog reset tests, rollback tests, retained diagnostic context and state-invariant checks after recovery. Evidence should identify the resources, owners, waiters, blocked stack traces and recovery path.

A useful deadlock test does not only wait for a timeout. It creates the risky acquisition pattern, proves the cycle is detected or prevented, and verifies that the system returns to a consistent state.

Common Mistakes

The most common mistake is treating every long wait as a deadlock. A slow dependency, saturated queue or contended lock may eventually release. Another mistake is relying on a watchdog reset without preserving enough context to identify the cycle. A third is adding locks to remove a race condition without checking resource ordering. A fourth is breaking a deadlock by killing work but not validating rollback.

A strong deadlock review states the resource graph, ordering rule, detection method, recovery action, timeout budget, safe-state behavior and validation evidence.

REF

See also