Glossary term

Failover

Engineering definition of failover covering detection, switchover time, backup capacity, split-brain prevention, RTO, RPO and validation.

Definition

concept

Failover is the controlled transfer of a required function from a failed or degraded primary path, component or service to a backup path, component or service.

Failover is used in distributed systems, control systems, telecommunications, protection architectures, power systems and operational resilience. It includes failure detection, decision logic, authority transfer, state transfer, backup-capacity check, service recovery, alarm behavior, failback rules and validation under realistic load. A backup only improves resilience if it is independent enough, sized enough and tested for the operating mode being claimed.

Failover is the controlled transfer of a required function from a failed or degraded primary path, component or service to a backup path, component or service. It is not the same as simply having a spare. The backup must be detected, selected, authorized, loaded and verified fast enough for the requirement.

Failover appears in distributed services, PLC and controller architectures, telecom transport, route diversity, backup power, protection relays, data acquisition, timing systems and safety-related equipment. The engineering question is what users, operators, controllers or downstream systems observe during the transition.

Failover Time

A useful failover-time model is:

t_{fo}=t_{detect}+t_{decide}+t_{transfer}+t_{recover}

where detection finds the fault, decision logic chooses the backup, transfer changes authority or routing, and recovery is the time until service metrics are stable.

The recovery-time objective is met only if:

t_{fo}\leq RTO

An alarm timestamp is not enough evidence. The measured value should include the interval in which traffic, control output, timing, data consistency or operator authority is actually affected.

Backup Capacity

Failover can preserve reachability while still failing the service. The backup must have enough capacity for the protected demand:

C_{backup}\geq C_{required}

If the backup is intentionally degraded, the reduced requirement should be explicit:

C_{backup}\geq C_{degraded}

This distinction matters in packet networks, standby pumps, backup controllers, emergency ventilation, cloud services and operational support teams. A backup path that carries alarms but not production load is a degraded-mode strategy, not full failover.

State and Data Loss

For stateful systems, failover also has a recovery-point objective. If updates arrive at rate:

r_u

and replication lag is:

t_{lag}

then the records at risk are:

N_{risk}=r_ut_{lag}

The recovery-point objective is met when:

N_{risk}\leq RPO

The same idea applies outside databases: queued commands, unsent telemetry, unacknowledged protection events, unlatched alarms and historian gaps can all become hidden state loss.

Split-Brain and Authority

Failover must prevent two active authorities from commanding the same function unless the design is explicitly multi-master. In software this can create split-brain writes. In control systems it can create competing commands. In power or telecom systems it can create unstable switching or route flapping.

A simple authority rule is:

N_{active}=1

for single-master control. If more than one active path is allowed, the arbitration, merge rule and conflict resolution must be tested.

Worked Example

A service has a failover requirement:

RTO=5.0\ \text{s}

Measured transition components are:

t_{detect}=0.8\ \text{s},\quad t_{decide}=0.2\ \text{s}
t_{transfer}=1.4\ \text{s},\quad t_{recover}=2.1\ \text{s}

Total failover time is:

t_{fo}=0.8+0.2+1.4+2.1=4.5\ \text{s}

The timing margin is:

M_{RTO}=5.0-4.5=0.5\ \text{s}

The backup path has capacity:

C_{backup}=750\ \text{Mbit/s}

and protected demand is:

C_{required}=620\ \text{Mbit/s}

so capacity margin is:

M_C=750-620=130\ \text{Mbit/s}

Now check state loss. Updates arrive at:

r_u=250\ \text{records/s}

and replication lag is:

t_{lag}=0.6\ \text{s}

Records at risk are:

N_{risk}=250(0.6)=150\ \text{records}

If the recovery-point objective is:

RPO=100\ \text{records}

then the failover passes timing and capacity but fails state-loss tolerance.

Validation Evidence

Useful evidence includes fault-injection tests, load tests, packet captures, controller traces, event logs, route-switching records, replication-lag measurements, alarm behavior, operator procedure checks, failback tests and proof that the backup was not sharing the failed dependency.

The test should include realistic load and realistic fault timing. A clean manual switchover during maintenance is not the same as automatic failover during overload, network partition, sensor disagreement or startup.

Failback

Failback is the return from backup to primary. It needs its own rule:

t_{stable}\geq t_{hold}

where the primary must remain healthy long enough before authority returns. Without hysteresis, systems can oscillate between paths and create more disruption than the original fault.

Common Mistakes

Do not validate failover only in the lab with no load. Do not ignore state transfer, stale data, clock behavior, operator authority or backup capacity. Do not call a reduced service full recovery. Do not allow automatic failback without a stability rule. Do not assume protection switching in a telecom path solves application state, database consistency or control authority.

A good failover design states the trigger, backup, authority rule, state boundary, timing objective, capacity objective, failback condition, degraded-mode behavior and evidence required before it is credited.

REF

See also