Glossary term
Fencing
Engineering definition of fencing covering stale leader isolation, storage fencing, output fencing, tokens, leases and split-brain prevention.
Definition
conceptFencing is the action that prevents an old, failed, partitioned or stale authority from continuing to write, command or own a function after another authority takes over.
Fencing is used in distributed systems, storage clusters, control systems, failover architectures and protection systems to prevent split-brain behavior. It can revoke storage access, disable outputs, remove credentials, isolate a network path, invalidate a lease, enforce a newer token, inhibit a command path or move equipment to a safe state. A fencing claim is credible only when the rejected side is proven unable to act.
Fencing is the action that prevents an old, failed, partitioned or stale authority from continuing to write, command or own a function after another authority takes over. It is the operational mechanism that makes leader election and failover safe.
Without fencing, a new leader may start correctly while the old leader is still able to write to storage, command equipment, advertise a route, publish state or acknowledge requests. That is a split-brain risk even if the new leader was elected by quorum.
Fencing is separate from fault detection. Detection says that something may be wrong; fencing changes the authority boundary so the wrong or stale side cannot keep acting. A design that detects a failed node but leaves its outputs, credentials or write path active has not completed failover.
Authority Condition
For a single-authority function, safe activation requires:
where:
is new authority active and:
is old authority still able to act. A design that cannot prove this condition should not claim safe failover.
Fencing Types
Storage fencing blocks stale writes to shared storage. Output fencing disables physical outputs or command paths. Credential fencing revokes tokens, certificates, roles or write permissions. Network fencing removes a route, port, VLAN, session or address. Lease fencing invalidates old authority after a bounded time and clock uncertainty.
Control-system fencing may inhibit an output module, de-energize a relay, command a safe state, block a bus master or require manual authority transfer. The right method depends on what can cause harm if the old authority continues.
The protected resource should enforce the fence as close to the hazard as practical. A database should reject stale write tokens. An actuator path should reject stale command authority. A network should stop advertising the old route. A human procedure should make the rejected authority visible and unambiguous.
Token Fencing
Many systems use monotonic fencing tokens. A new authority receives:
and old commands carry:
The receiver accepts only newer authority:
and rejects stale commands:
The token must be checked by the resource that can be damaged or corrupted, not only by the failover coordinator.
Timing Rule
If failover activation is scheduled at:
and fencing plus verification take:
then safe activation requires:
If the old side cannot be fenced in time, the system should wait, degrade or move to safe state instead of allowing dual authority.
Worked Example
A controller cluster detects primary failure after:
Leader election takes:
Output fencing takes:
and verification takes:
The new controller is scheduled to activate after:
The fencing completion time is:
The activation margin is:
This is technically positive but weak. The team should either increase the activation delay, speed up fencing or provide stronger evidence that the old controller cannot command.
Now consider stale command exposure. If the old controller could emit:
during the fencing and verification interval:
then commands at risk without effective fencing would be:
The purpose of fencing is to make accepted stale commands:
not merely to reduce their probability.
Validation Evidence
Useful evidence includes failed-node isolation tests, storage write rejection, output inhibit tests, token monotonicity logs, credential revocation logs, network isolation records, lease-expiry tests, clock-skew tests, operator transfer drills and proof that the rejected side cannot act after recovery.
The validation should test the side being rejected. A failover test that only shows the new leader working does not prove fencing.
Common Mistakes
Do not assume that stopping a process fences the node. The old authority may continue through a delayed write, cached credential, stuck output, network route, operator session or stale lease. Do not trust a coordinator-only token if the protected resource does not enforce it. Do not allow manual recovery to bypass fencing because the incident is urgent.
Fencing is a control boundary. A strong design states exactly what is fenced, who verifies it, how long it takes, what evidence proves rejection and what the system does when fencing cannot be confirmed.