Glossary term

Software Circuit Breaker

Engineering definition of software circuit breaker covering open, half-open and closed states, retry containment, cooldown and validation.

Definition

concept

A software circuit breaker is a resilience mechanism that temporarily stops or limits calls to a failing dependency so the caller and dependency do not amplify the failure.

Software circuit breakers are used in distributed services, APIs, edge gateways, control platforms and operational systems to bound retry storms and dependency overload. They usually move between closed, open and half-open states based on failure ratio, timeout rate, latency, rejection rate, cooldown and recovery probes. A software circuit breaker is distinct from an electrical circuit breaker even though both interrupt harmful flow.

A software circuit breaker is a resilience mechanism that temporarily stops or limits calls to a failing dependency so the caller and dependency do not amplify the failure. It is a software pattern, not the electrical protection device with the same name.

The usual states are closed, open and half-open. Closed means calls are allowed. Open means most calls are blocked, rejected, failed fast or routed to a degraded response. Half-open means a small number of probe calls are allowed to test whether the dependency has recovered.

Trip Condition

For a monitoring window with:

N_{window}

calls and:

N_{fail}

failed or timed-out calls, the observed failure ratio is:

\displaystyle p_f=\frac{N_{fail}}{N_{window}}

A simple trip rule is:

N_{window}\geq N_{min}\quad \text{and}\quad p_f\geq p_{trip}

The minimum-window condition prevents one or two early failures from opening the breaker during harmless noise.

Retry Containment

Retries can multiply load during partial failure. If an original request rate is:

\lambda_0

and each request can retry twice with failure probability:

p_f

then a simple expected-attempt multiplier is:

E[a]=1+p_f+p_f^2

Effective dependency arrival rate is:

\lambda_{eff}=\lambda_0E[a]

Opening the circuit breaker reduces dependency pressure by failing fast or using a degraded response instead of continuing to send retries into an overloaded dependency.

The breaker should be coordinated with retry policy. If callers retry the fast failure immediately, the dependency may be protected but the caller tier can still saturate its own workers, queues or network sockets. The release design should state whether rejected calls are retried, cached, queued, shed or surfaced to the user.

Half-Open Recovery

After cooldown:

t_{open}\geq t_{cooldown}

the breaker may allow:

N_{probe}

probe calls. If:

\displaystyle \frac{N_{success}}{N_{probe}}\geq p_{recover}

the breaker can close. If probes fail, it should reopen and extend or repeat cooldown according to the release policy.

Worked Example

A service observes:

N_{window}=200

dependency calls, with:

N_{fail}=90

The failure ratio is:

\displaystyle p_f=\frac{90}{200}=0.45

The trip threshold is:

p_{trip}=0.30

and:

N_{min}=100

Since:

200\geq100

and:

0.45\geq0.30

the breaker should open.

Original request rate is:

\lambda_0=650\ \text{requests/s}

With two retries and:

p_f=0.45

the expected-attempt multiplier is:

E[a]=1+0.45+0.45^2=1.6525

so dependency attempt rate would be:

\lambda_{eff}=650(1.6525)=1074.1\ \text{requests/s}

If dependency capacity is:

C_{dep}=900\ \text{requests/s}

the overload is:

1074.1-900=174.1\ \text{requests/s}

Now open the breaker and allow:

N_{probe}=20

half-open probes every:

t_{probe}=30\ \text{s}

Probe load is:

\displaystyle \lambda_{probe}=\frac{20}{30}=0.667\ \text{requests/s}

This protects the dependency while still collecting recovery evidence.

Boundary With Load Shedding

Load shedding rejects or drops work to preserve the system boundary. A software circuit breaker is more specific: it rejects or limits work because a named dependency is unhealthy. The two mechanisms often work together. The breaker can stop calls to the failing dependency, while load shedding protects the caller from excessive queued work after fast failures.

The distinction matters for telemetry. A spike in breaker-open rejections means dependency protection is active. A spike in load-shed requests means the caller itself is protecting capacity. Both should be visible in dashboards and post-incident analysis.

Degraded Response

Opening a software circuit breaker is not automatically a user-visible outage. The caller may return cached data, queue noncritical work, reject unsafe commands, serve read-only mode, shed low-priority traffic, use a backup dependency or enter degraded mode.

The degraded response must be honest. Returning stale data, accepting commands without authorization or hiding a failed write can be worse than failing fast.

Validation Evidence

Useful evidence includes fault injection, threshold tests, timeout logs, retry counters, queue depth, open/half-open/closed state transitions, probe results, degraded-response tests, client behavior, alert behavior and recovery timing.

The breaker should be tested under load. A breaker that opens correctly in a unit test may still fail if clients retry the fast failure aggressively or if every instance probes at the same instant.

Common Mistakes

Do not use a software circuit breaker as a substitute for capacity planning. Do not set thresholds without a minimum sample size. Do not let half-open probes synchronize across thousands of clients. Do not return unsafe stale data. Do not confuse software dependency protection with an electrical circuit breaker.

A good breaker design states the protected dependency, failure definition, minimum window, trip threshold, cooldown, probe policy, degraded response, client retry behavior and validation evidence.

REF

See also