Glossary term
Failover
Engineering definition of failover covering detection, switchover time, backup capacity, split-brain prevention, RTO, RPO and validation.
Definition
conceptFailover is the controlled transfer of a required function from a failed or degraded primary path, component or service to a backup path, component or service.
Failover is used in distributed systems, control systems, telecommunications, protection architectures, power systems and operational resilience. It includes failure detection, decision logic, authority transfer, state transfer, backup-capacity check, service recovery, alarm behavior, failback rules and validation under realistic load. A backup only improves resilience if it is independent enough, sized enough and tested for the operating mode being claimed.
Failover is the controlled transfer of a required function from a failed or degraded primary path, component or service to a backup path, component or service. It is not the same as simply having a spare. The backup must be detected, selected, authorized, loaded and verified fast enough for the requirement.
Failover appears in distributed services, PLC and controller architectures, telecom transport, route diversity, backup power, protection relays, data acquisition, timing systems and safety-related equipment. The engineering question is what users, operators, controllers or downstream systems observe during the transition.
Failover Time
A useful failover-time model is:
where detection finds the fault, decision logic chooses the backup, transfer changes authority or routing, and recovery is the time until service metrics are stable.
The recovery-time objective is met only if:
An alarm timestamp is not enough evidence. The measured value should include the interval in which traffic, control output, timing, data consistency or operator authority is actually affected.
Backup Capacity
Failover can preserve reachability while still failing the service. The backup must have enough capacity for the protected demand:
If the backup is intentionally degraded, the reduced requirement should be explicit:
This distinction matters in packet networks, standby pumps, backup controllers, emergency ventilation, cloud services and operational support teams. A backup path that carries alarms but not production load is a degraded-mode strategy, not full failover.
State and Data Loss
For stateful systems, failover also has a recovery-point objective. If updates arrive at rate:
and replication lag is:
then the records at risk are:
The recovery-point objective is met when:
The same idea applies outside databases: queued commands, unsent telemetry, unacknowledged protection events, unlatched alarms and historian gaps can all become hidden state loss.
Split-Brain and Authority
Failover must prevent two active authorities from commanding the same function unless the design is explicitly multi-master. In software this can create split-brain writes. In control systems it can create competing commands. In power or telecom systems it can create unstable switching or route flapping.
A simple authority rule is:
for single-master control. If more than one active path is allowed, the arbitration, merge rule and conflict resolution must be tested.
Worked Example
A service has a failover requirement:
Measured transition components are:
Total failover time is:
The timing margin is:
The backup path has capacity:
and protected demand is:
so capacity margin is:
Now check state loss. Updates arrive at:
and replication lag is:
Records at risk are:
If the recovery-point objective is:
then the failover passes timing and capacity but fails state-loss tolerance.
Validation Evidence
Useful evidence includes fault-injection tests, load tests, packet captures, controller traces, event logs, route-switching records, replication-lag measurements, alarm behavior, operator procedure checks, failback tests and proof that the backup was not sharing the failed dependency.
The test should include realistic load and realistic fault timing. A clean manual switchover during maintenance is not the same as automatic failover during overload, network partition, sensor disagreement or startup.
Failback
Failback is the return from backup to primary. It needs its own rule:
where the primary must remain healthy long enough before authority returns. Without hysteresis, systems can oscillate between paths and create more disruption than the original fault.
Common Mistakes
Do not validate failover only in the lab with no load. Do not ignore state transfer, stale data, clock behavior, operator authority or backup capacity. Do not call a reduced service full recovery. Do not allow automatic failback without a stability rule. Do not assume protection switching in a telecom path solves application state, database consistency or control authority.
A good failover design states the trigger, backup, authority rule, state boundary, timing objective, capacity objective, failback condition, degraded-mode behavior and evidence required before it is credited.