Glossary term

Distributed Lock

Engineering definition of a distributed lock covering exclusive ownership, leases, fencing tokens, stale owner rejection, lock-service capacity and validation.

Definition

concept

A distributed lock is a coordination mechanism that grants one process, node or service instance temporary exclusive ownership of a shared resource across a distributed system.

Distributed locks appear in clustered schedulers, replicated services, databases, cache rebuilds, leader-adjacent coordination, migration jobs, singleton workers and operational automation. A useful design states the protected resource, owner identity, acquisition rule, lease duration, renewal rule, fencing-token enforcement, timeout behavior, retry behavior, failure consequence, clock assumptions and validation evidence.

A distributed lock is a coordination mechanism that grants one process, node or service instance temporary exclusive ownership of a shared resource across a distributed system. It is the distributed analogue of a mutex, but the failure model is much harder because messages, clocks, processes and networks can fail independently.

Distributed locks are used for singleton jobs, cache rebuilds, schema migrations, batch ownership, partition assignment, shared-device access and operational automation. They should not be treated as magic mutual exclusion. A lock is credible only if the protected resource rejects stale owners or the consequence of duplicate ownership is harmless.

Ownership Contract

For a protected resource:

R

and a set of owners:

O

the safety requirement is:

|\{o\in O:valid(o,R,t)\}|\leq1

at any time:

t

where valid means the owner is allowed to act on the protected resource. Status display is not enough. The resource boundary must agree with the lock boundary.

Acquisition Path

A lock acquisition usually has request, coordination, commit and reply phases. A first-pass acquisition latency is:

T_{acq}=T_{req}+T_{coord}+T_{commit}+T_{reply}

If a quorum-backed lock service is used, the coordination phase may include voting, log replication or conditional write latency. If a single cache key is used, acquisition may be faster but its safety depends on the cache failure semantics and the resource being protected.

The acquisition result should include an owner identity and a monotonic fencing token:

(owner,F)

The token is more important than the friendly owner name because it lets the resource reject stale commands.

Lease Renewal

Many distributed locks use a lease. The owner holds authority for a bounded duration:

T_L

and attempts renewal every:

T_r

If maximum clock uncertainty is:

\epsilon

network delay allowance is:

d_n

and process pause allowance is:

t_p

a simple lease margin screen is:

M_L=T_L-(T_r+2\epsilon+d_n+t_p)

The screen is acceptable only when:

M_L>0

This is not a proof of safety. It is a check that the lease duration is not obviously shorter than the renewal and uncertainty budget.

Fencing Tokens

A fencing token is a monotonic value issued with the lock. The protected resource stores the highest accepted token:

F_{accepted}

and accepts a command only when:

F_{cmd}\geq F_{accepted}

It rejects stale owners when:

F_{cmd}<F_{accepted}

This matters when an old owner pauses, loses connectivity, misses its renewal, and later resumes. Without token enforcement at the resource, the old owner may continue writing even though another process acquired the lock.

Capacity and Contention

If arrivals request the lock at rate:

\lambda

and the average protected hold time is:

T_h

the serial utilization screen is:

\rho=\lambda T_h

When:

\rho\geq1

the lock is saturated and wait time will grow. Retry storms can make this worse because failed contenders may retry together after the same timeout.

For:

\lambda=180\ \text{requests/s},\quad T_h=0.004\ \text{s}

the utilization is:

\rho=180\cdot0.004=0.72

If a dependency slowdown raises hold time to:

T_h=0.008\ \text{s}

then:

\rho=180\cdot0.008=1.44

and the lock becomes an overload amplifier.

Lease Example

A lock lease has:

T_L=10.0\ \text{s}

renewal interval:

T_r=4.0\ \text{s}

clock uncertainty:

\epsilon=0.2\ \text{s}

network allowance:

d_n=0.3\ \text{s}

and process pause allowance:

t_p=1.0\ \text{s}

The lease margin is:

M_L=10.0-(4.0+2(0.2)+0.3+1.0)=4.3\ \text{s}

The margin is positive, but production release would still require pause tests, clock-skew tests, partition tests and resource-side token rejection.

Boundary With Leader Election

Leader election chooses authority for a role or epoch. A distributed lock grants temporary ownership of a resource or task. Some systems implement leadership with a lock, but the safety requirements are the same: quorum or lease assumptions, monotonic terms or tokens, stale-owner rejection and clear behavior under partition.

Do not use a lock to hide a missing idempotency rule. If a locked operation can be retried after an ambiguous timeout, the command still needs a request identity, fencing token or compensating rule.

Validation

Validation should include concurrent acquisition attempts, owner crash, owner pause, network partition, delayed renewal, duplicate request, stale owner resume, lock-service restart, clock skew, resource-side token rejection, client timeout, retry storm and manual recovery tests.

Useful evidence includes acquisition latency distribution, lease-renewal margin, token monotonicity logs, rejected stale commands, lock wait time, hold-time distribution, queue depth, timeout count, retry rate, failover traces and proof that the protected resource cannot be changed by an expired owner.

Failure Modes

Common failure modes include relying on wall-clock expiry without clock uncertainty, deleting a lock that another owner renewed, accepting stale writes because the resource never checks a fencing token, setting a lease shorter than realistic process pauses, using a single lock service as a hidden single point of failure, retrying all contenders at once, holding the lock while calling a slow dependency, and assuming that lock acquisition means the previous owner has actually stopped.

A distributed lock is a narrow coordination tool. It can be appropriate when duplicate ownership is bounded and validated. It is dangerous when used as a substitute for consensus, fencing, idempotency, compensation or resource-level safety checks.

REF

See also