Formula sheet

Operating Systems, Concurrency, and Distributed Systems Formula Sheet

OS and distributed-system formulas for thread capacity, locks, queues, retries, quorums, consensus, logical clocks, isolation, leases, rollout, and validation.

This formula sheet collects first-pass calculations used in operating-system, concurrency, and distributed-system engineering. Use it to screen thread-pool capacity, lock contention, queue stability, retry amplification, quorum consistency, replicated availability, fan-out tail latency, clock and lease margins, rollout exposure, and validation evidence.

The formulas do not replace profiling, tracing, failure injection, scheduler documentation, production telemetry, or formal protocol analysis. They are useful when the workload boundary, resource limit, failure mode, percentile target, and validation evidence are explicit.

Notation

SymbolMeaningTypical unit
\lambdaarrival raterequests/s, jobs/s
\muservice rate of one server or workerjobs/s
Smean service times
cnumber of equivalent workers, threads, replicas, or serverscount
\rhoutilizationdimensionless
Laverage work in systemjobs
Waverage time in systems
Qqueue lengthjobs
Ddeadline, timeout, or latency targets
Aavailabilitydimensionless
Nreplica count or quorum populationcount
R,W_qread and write quorum sizecount
pprobability of success or failure, as stateddimensionless
Clogical clock value, commit index, or count as statedcount
Vvector clockvector of counters
ilog index, retry index, or sequence indexcount
done-way network delay bound or estimates

State whether a metric is mean, percentile, maximum observed, engineered worst case, or contractual service objective. Averages are rarely enough for concurrency and distributed-system release decisions.

Worker capacity and utilization

For one worker with mean service time S:

\displaystyle \mu=\frac{1}{S}

For c equivalent workers:

\displaystyle \mu_{total}=c\mu=\frac{c}{S}

Utilization:

\displaystyle \rho=\frac{\lambda}{c\mu}=\frac{\lambda S}{c}

A stable first-pass condition is:

\rho<1

Engineering systems normally need a stricter operating target:

\rho\leq\rho_{target}

where \rho_{target} may be chosen from latency, burst, failure, or maintenance requirements. A thread pool that runs near \rho=1 has little margin for garbage collection, interrupts, lock contention, dependency slowdown, or retries.

Little’s Law

For a stable system:

L=\lambda W

Queueing consistency check:

\displaystyle W=\frac{L}{\lambda}

If an average queue depth is L=240 requests at throughput \lambda=120\ \text{requests/s}, the average time represented by that queue is:

\displaystyle W=\frac{240}{120}=2\ \text{s}

This is often more useful than a CPU graph. A growing queue is stored latency, not free buffering.

Time to fill a bounded queue

If arrivals exceed service rate for a burst:

\lambda>\mu_{total}

net queue growth is:

g=\lambda-\mu_{total}

Time to fill a remaining buffer of B_{free} jobs:

\displaystyle t_{fill}=\frac{B_{free}}{g}

If t_{fill} is shorter than detection, autoscaling, failover, or operator response time, the queue limit is not protective enough.

Context-switch overhead

CPU time consumed by context switches:

T_{sw}=f_{sw}t_{sw}

where f_{sw} is switches per second and t_{sw} is mean switch cost.

Fraction of one core:

U_{sw}=f_{sw}t_{sw}

For C cores, rough system fraction:

\displaystyle U_{sw,total}=\frac{f_{sw}t_{sw}}{C}

Direct switch time may be small while indirect cost is large. Cache disruption, TLB misses, scheduler locks, interrupt load, and lost locality can dominate tail latency.

Critical-section saturation

If each request enters a critical section m times and each entry holds the lock for t_{cs}:

\lambda_{cs}=m\lambda

Lock service rate:

\displaystyle \mu_{lock}=\frac{1}{t_{cs}}

Lock utilization:

\rho_{lock}=\lambda_{cs}t_{cs}=m\lambda t_{cs}

A necessary stability condition is:

\rho_{lock}<1

When \rho_{lock} approaches one, adding worker threads can make performance worse by increasing contention. Reduce critical-section duration, shard the state, batch updates, use per-core counters, change ownership to a queue, or redesign the shared invariant.

Parallel speedup with serial fraction

Amdahl-style speedup:

\displaystyle S_N=\frac{1}{f_s+\frac{1-f_s}{N}+o_N}

where f_s is serial fraction, N is parallel workers, and o_N is coordination overhead.

Maximum speedup as N becomes large:

\displaystyle S_{max}\approx\frac{1}{f_s}

If 8\% of the path is serialized by a lock, the best possible speedup is about:

\displaystyle S_{max}\approx\frac{1}{0.08}=12.5

This limit appears even before cache, memory bandwidth, scheduler, and network overhead are added.

Deadlock wait-for condition

A deadlock exists when the wait-for graph contains a directed cycle:

T_1\rightarrow T_2\rightarrow \cdots \rightarrow T_k\rightarrow T_1

A strict resource ordering prevents this class of deadlock when every task acquires resources only in increasing order:

L_1<L_2<\cdots<L_n

Timeouts can help recovery, but they do not prove the absence of deadlock. The release argument should identify lock ordering, resource ownership, cancellation behavior, and recovery state.

Timeout budget

End-to-end timeout budget:

T_{end}=T_{client}+T_{network}+T_{queue}+T_{service}+T_{dependency}+T_{margin}

For a call chain:

T_{chain}=\sum_i T_i

A service timeout should be shorter than the caller’s remaining deadline:

T_{service,timeout}<T_{caller,remaining}

If lower layers time out after upper layers have already abandoned the request, work continues uselessly and can amplify overload.

Retry amplification

If an initial request can retry up to r times and each attempt fails with probability p_f independently, expected attempts are:

\displaystyle E[a]=\sum_{i=0}^{r}p_f^i=\frac{1-p_f^{r+1}}{1-p_f}

for p_f\neq1.

Effective arrival rate:

\lambda_{eff}=\lambda_0E[a]

Worst-case attempt multiplier when every attempt is made:

A_{max}=r+1

Retries should be bounded, jittered, and tied to idempotency. A retry policy that is harmless under light load can cause a failure cascade when the dependency is already saturated.

Backoff timing

Exponential backoff delay for retry i:

b_i=\min(b_{max},b_0\alpha^i)

Total wait before the final retry:

B_r=\sum_{i=0}^{r-1} b_i

If B_r exceeds the caller deadline, the later retry cannot succeed within the service objective. If B_r is too small, clients may synchronize and create retry waves.

Quorum consistency

For N replicas, read quorum R, and write quorum W_q, read/write intersection is guaranteed when:

R+W_q>N

Write/write intersection is guaranteed when:

\displaystyle W_q>\frac{N}{2}

Majority quorum size:

\displaystyle Q_m=\left\lfloor\frac{N}{2}\right\rfloor+1

Faults tolerated by majority quorum:

\displaystyle f=\left\lfloor\frac{N-1}{2}\right\rfloor

These formulas screen quorum geometry. They do not prove protocol safety under clock errors, stale leaders, split brain, storage corruption, or incorrect membership changes.

Majority commit index

For a replicated log, let match_j be the highest log index known to be stored on replica j. A majority-committed index is the largest index i stored on at least Q_m replicas:

C_{commit}=\max\left\{i:\left|\{j:match_j\geq i\}\right|\geq Q_m\right\}

For a proposed entry at index i_e, acknowledgement is safe only if:

\left|\{j:match_j\geq i_e\}\right|\geq Q_m

Majority intersection:

Q_m+Q_m>N

For N=5:

Q_m=3,\qquad 3+3=6>5

The intersection property is what prevents a committed replicated-log entry from disappearing in a correct consensus algorithm. It does not by itself prove linearizability: election rules, term checks, log freshness, durable storage, client retry semantics, and read protocol still matter.

Logical clock update rules

For a Lamport clock, local event or send:

C'=C+1

On receive of a message with timestamp C_m:

C'=\max(C,C_m)+1

Lamport causality implication:

a\rightarrow b\Rightarrow C(a)<C(b)

The reverse is not guaranteed:

C(a)<C(b)\not\Rightarrow a\rightarrow b

For vector clocks with component k owned by process k, local event or send:

V'_k=V_k+1
V'_r=V_r\quad r\neq k

On receive of message vector V_m at process k:

V'_r=\max(V_r,V_{m,r})\quad \text{for all }r

then increment the receiver component:

V''_k=V'_k+1

Causal order between two vector timestamps:

V_a\leq V_b\ \text{for all components and}\ V_a\neq V_b

Concurrent events:

\neg(V_a\leq V_b)\ \text{and}\ \neg(V_b\leq V_a)

Logical clocks are evidence models. They do not replace durable ordering, transaction isolation, or consensus.

Transaction isolation screens

Let R_1,W_1 and R_2,W_2 be read and write sets for two concurrent transactions.

Write-write conflict:

W_1\cap W_2\neq\emptyset

Snapshot isolation normally rejects one transaction when this conflict is detected on the same versioned item.

Write-skew risk appears when the transactions read overlapping invariant state but write disjoint items:

R_1\cap R_2\neq\emptyset
W_1\cap W_2=\emptyset

and the invariant depends on the combination:

I(R_1,R_2,W_1,W_2)=\text{false after both commits}

Serialization graph cycle condition:

T_1\rightarrow T_2\rightarrow \cdots \rightarrow T_k\rightarrow T_1

If the dependency graph contains a cycle, the history is not conflict-serializable.

Approximate version retention for long snapshots:

V_{retained}\approx \lambda_w T_{snapshot}

where \lambda_w is the write rate of retained versions and T_{snapshot} is the maximum snapshot age. Long analytical reads can therefore become storage, garbage-collection, and vacuum pressure even when the transaction logic is correct.

Two-phase commit timing

For two-phase commit with parallel prepare requests, one-way network delay d, participant prepare flush time t_{p,i}, coordinator decision flush time t_c, and commit-message one-way delay d:

T_{prepare}=2d+\max_i(t_{p,i})

Decision durable at the coordinator:

T_{decision}=T_{prepare}+t_c

Earliest commit-message delivery:

T_{commit}=T_{decision}+d

If participants vote yes and enter prepared state, but no durable final decision is reachable, the transaction can block:

\text{prepared}=\text{true},\quad \text{decision reachable}=\text{false}

Two-phase commit is an atomicity protocol, not an availability proof. The blocking window should be compared with lock hold limits, version retention, user-visible timeout, recovery objective, and the availability of the coordinator’s decision record.

CRDT convergence checks

For a state-based conflict-free replicated data type with merge function M, convergence requires merge properties such as commutativity:

M(x,y)=M(y,x)

associativity:

M(M(x,y),z)=M(x,M(y,z))

and idempotence:

M(x,x)=x

If every update eventually reaches every non-failed replica and merge has these properties, replicas that have seen the same set of updates converge:

S_A=S_B\quad \text{after same delivered update set}

CRDT checks support eventual consistency, not arbitrary invariants. If the application needs linearizability, uniqueness, or cross-object constraints, a mergeable data type may need escrow, reservation, consensus, or a compensating workflow.

Replicated availability

Availability of a repairable component:

\displaystyle A=\frac{MTBF}{MTBF+MTTR}

Series availability:

A_{series}=\prod_i A_i

Parallel availability when any replica can serve:

A_{parallel}=1-\prod_i(1-A_i)

Probability that at least q of N identical independent replicas are available:

A_{\geq q}=\sum_{k=q}^{N}\binom{N}{k}A^k(1-A)^{N-k}

Independence is a strong assumption. Shared power, common software bugs, shared configuration, shared deployment, network partition, and correlated load can dominate real availability.

Fan-out tail latency

If a request fans out to n independent dependencies and completes only when all dependencies finish, then for dependency latency cumulative distribution F(t):

F_{max}(t)=F(t)^n

Tail probability:

P(T_{max}>t)=1-F(t)^n

If each dependency meets a latency target with probability p, all n dependencies meet it with probability:

p_{all}=p^n

Fan-out makes good single-service percentiles look worse at the aggregate boundary. Use hedging, caching, partial results, admission control, or dependency reduction only after checking side effects.

Error budget and burn rate

Allowed error fraction for an availability or success SLO:

e_{budget}=1-SLO

Allowed error count over a window with N requests:

E_{budget}=N(1-SLO)

Error-budget burn rate:

\displaystyle BR=\frac{e_{observed}}{e_{budget}}

where e_{observed} is the observed error fraction over the same basis.

If BR>1, the system is consuming error budget faster than allowed. Rollout, release, or capacity changes should have stop rules tied to burn rate, latency, saturation, and data integrity.

Lease and clock margin

If a lease duration is T_L, maximum clock uncertainty is \epsilon, network delay bound is d_n, and process pause bound is t_p, a simple safety margin is:

M_L=T_L-(2\epsilon+d_n+t_p)

The lease margin is acceptable only if:

M_L>0

This is a screen, not a consensus proof. Lease safety depends on monotonic clocks, clock synchronization assumptions, pause behavior, storage semantics, renewal rules, and what action is allowed while the lease is believed valid.

Rollout exposure

Canary exposure count:

N_c=f_cN

where f_c is canary fraction and N is total request or user exposure in the review window.

Expected failures at failure probability p_f:

E_f=N_cp_f

Observed failure fraction:

\displaystyle \hat{p}_f=\frac{x_f}{N_c}

Do not accept a canary only because no failures were observed when N_c is too small to exercise the risk. Define minimum exposure, representative traffic, stop metrics, rollback triggers, and data compatibility before rollout.

Validation guard margins

For an upper limit:

M=x_{limit}-x_{meas}-ku_x

For a lower limit:

M=x_{meas}-x_{min}-ku_x

where u_x is standard uncertainty and k is the selected guard factor.

For latency percentiles, state the sample size, percentile estimator, clock source, load condition, and whether coordinated omission is controlled. A green average latency does not validate a tail-latency SLO.

Worked screening example

A distributed service has:

QuantityValue
external arrival rate\lambda_0=180\ \text{requests/s}
worker countc=6
mean service time before retriesS=24\ \text{ms}
lock entries per requestm=2
lock hold timet_{cs}=0.8\ \text{ms}
retry limitr=2
dependency failure probability during incidentp_f=0.15
fan-out dependency countn=8
single dependency target success probabilityp=0.995
replica countN=3
quorum sizeq=2
replica availabilityA=0.995

Worker utilization before retries:

\displaystyle \rho=\frac{\lambda_0S}{c}=\frac{180(0.024)}{6}=0.72

The pool has usable headroom but is not lightly loaded.

Lock utilization:

\rho_{lock}=m\lambda_0t_{cs}=2(180)(0.0008)=0.288

The lock is not saturated at the base rate. Now include retries:

E[a]=1+0.15+0.15^2=1.1725

Effective arrival rate:

\lambda_{eff}=180(1.1725)=211.1\ \text{requests/s}

Worker utilization under retry load:

\displaystyle \rho_{eff}=\frac{211.1(0.024)}{6}=0.844

The retry policy consumes most of the remaining capacity. If the dependency slows down at the same time, the pool may cross into queue growth.

Fan-out target probability:

p_{all}=0.995^8=0.961

Even excellent single-dependency target probability gives only about 96.1\% all-dependency success at that boundary.

Quorum availability for at least two available replicas:

A_{\geq2}=\binom{3}{2}(0.995)^2(0.005)+(0.995)^3
A_{\geq2}=0.999925

This looks strong under independent failures, but it is not valid for a shared software bug, bad deployment, corrupted configuration, or region-wide network fault.

Engineering comment

The release risk is not a single formula. The service has acceptable base worker and lock utilization, but retries drive utilization close to saturation, fan-out weakens the aggregate target probability, and quorum availability assumes independence. The release decision should require retry backoff limits, dependency circuit breaking, tail-latency telemetry, quorum failure injection, rollback criteria, and validation under degraded dependency behavior.

Release checklist

Before accepting an OS, concurrency, or distributed-system calculation, confirm that:

  1. the workload boundary and arrival-rate basis are explicit;
  2. utilization includes retries, background work, interrupts, and dependency slowdown where relevant;
  3. queue limits are tied to time-to-fill and backpressure behavior;
  4. critical sections are bounded and lock ordering is enforced;
  5. timeout and retry budgets fit inside caller deadlines;
  6. quorum rules are checked against membership, failure, and stale-leader assumptions;
  7. logical-clock evidence is not confused with consensus or serializable ordering;
  8. isolation-level calculations include write skew, predicate invariants, and version retention;
  9. two-phase commit timing includes prepared-state blocking and decision-record recovery;
  10. CRDT or eventual-consistency designs state which invariants are merge-safe and which are not;
  11. availability calculations state correlated-failure limits;
  12. fan-out tail risk is evaluated at the service boundary;
  13. rollout exposure is large enough to test the claimed risk;
  14. validation uses percentiles, failure injection, telemetry, and rollback evidence, not only nominal tests.

Common mistakes

Common mistakes include treating CPU average as capacity, adding threads after a lock has saturated, setting retries without idempotency, using majority quorum formulas while ignoring membership changes, treating Lamport timestamps as proof of causality in both directions, assuming snapshot isolation is serializable, and multiplying independent availability numbers across components that share the same deployment or network.

Another frequent mistake is validating only the happy path. Distributed systems fail during partial dependency failure, clock drift, queue buildup, retry storms, version mismatch, prepared transaction recovery, merge conflicts, and leadership changes. The formulas are useful only when they drive explicit limits, tests, telemetry, and operating decisions.

REF

See also