Formula sheet
Operating Systems, Concurrency, and Distributed Systems Formula Sheet
OS and distributed-system formulas for thread capacity, locks, queues, retries, quorums, consensus, logical clocks, isolation, leases, rollout, and validation.
This formula sheet collects first-pass calculations used in operating-system, concurrency, and distributed-system engineering. Use it to screen thread-pool capacity, lock contention, queue stability, retry amplification, quorum consistency, replicated availability, fan-out tail latency, clock and lease margins, rollout exposure, and validation evidence.
The formulas do not replace profiling, tracing, failure injection, scheduler documentation, production telemetry, or formal protocol analysis. They are useful when the workload boundary, resource limit, failure mode, percentile target, and validation evidence are explicit.
Notation
| Symbol | Meaning | Typical unit |
|---|---|---|
| \lambda | arrival rate | requests/s, jobs/s |
| \mu | service rate of one server or worker | jobs/s |
| S | mean service time | s |
| c | number of equivalent workers, threads, replicas, or servers | count |
| \rho | utilization | dimensionless |
| L | average work in system | jobs |
| W | average time in system | s |
| Q | queue length | jobs |
| D | deadline, timeout, or latency target | s |
| A | availability | dimensionless |
| N | replica count or quorum population | count |
| R,W_q | read and write quorum size | count |
| p | probability of success or failure, as stated | dimensionless |
| C | logical clock value, commit index, or count as stated | count |
| V | vector clock | vector of counters |
| i | log index, retry index, or sequence index | count |
| d | one-way network delay bound or estimate | s |
State whether a metric is mean, percentile, maximum observed, engineered worst case, or contractual service objective. Averages are rarely enough for concurrency and distributed-system release decisions.
Worker capacity and utilization
For one worker with mean service time S:
For c equivalent workers:
Utilization:
A stable first-pass condition is:
Engineering systems normally need a stricter operating target:
where \rho_{target} may be chosen from latency, burst, failure, or maintenance requirements. A thread pool that runs near \rho=1 has little margin for garbage collection, interrupts, lock contention, dependency slowdown, or retries.
Little’s Law
For a stable system:
Queueing consistency check:
If an average queue depth is L=240 requests at throughput \lambda=120\ \text{requests/s}, the average time represented by that queue is:
This is often more useful than a CPU graph. A growing queue is stored latency, not free buffering.
Time to fill a bounded queue
If arrivals exceed service rate for a burst:
net queue growth is:
Time to fill a remaining buffer of B_{free} jobs:
If t_{fill} is shorter than detection, autoscaling, failover, or operator response time, the queue limit is not protective enough.
Context-switch overhead
CPU time consumed by context switches:
where f_{sw} is switches per second and t_{sw} is mean switch cost.
Fraction of one core:
For C cores, rough system fraction:
Direct switch time may be small while indirect cost is large. Cache disruption, TLB misses, scheduler locks, interrupt load, and lost locality can dominate tail latency.
Critical-section saturation
If each request enters a critical section m times and each entry holds the lock for t_{cs}:
Lock service rate:
Lock utilization:
A necessary stability condition is:
When \rho_{lock} approaches one, adding worker threads can make performance worse by increasing contention. Reduce critical-section duration, shard the state, batch updates, use per-core counters, change ownership to a queue, or redesign the shared invariant.
Parallel speedup with serial fraction
Amdahl-style speedup:
where f_s is serial fraction, N is parallel workers, and o_N is coordination overhead.
Maximum speedup as N becomes large:
If 8\% of the path is serialized by a lock, the best possible speedup is about:
This limit appears even before cache, memory bandwidth, scheduler, and network overhead are added.
Deadlock wait-for condition
A deadlock exists when the wait-for graph contains a directed cycle:
A strict resource ordering prevents this class of deadlock when every task acquires resources only in increasing order:
Timeouts can help recovery, but they do not prove the absence of deadlock. The release argument should identify lock ordering, resource ownership, cancellation behavior, and recovery state.
Timeout budget
For a call chain:
A service timeout should be shorter than the caller’s remaining deadline:
If lower layers time out after upper layers have already abandoned the request, work continues uselessly and can amplify overload.
Retry amplification
If an initial request can retry up to r times and each attempt fails with probability p_f independently, expected attempts are:
for p_f\neq1.
Effective arrival rate:
Worst-case attempt multiplier when every attempt is made:
Retries should be bounded, jittered, and tied to idempotency. A retry policy that is harmless under light load can cause a failure cascade when the dependency is already saturated.
Backoff timing
Exponential backoff delay for retry i:
Total wait before the final retry:
If B_r exceeds the caller deadline, the later retry cannot succeed within the service objective. If B_r is too small, clients may synchronize and create retry waves.
Quorum consistency
For N replicas, read quorum R, and write quorum W_q, read/write intersection is guaranteed when:
Write/write intersection is guaranteed when:
Majority quorum size:
Faults tolerated by majority quorum:
These formulas screen quorum geometry. They do not prove protocol safety under clock errors, stale leaders, split brain, storage corruption, or incorrect membership changes.
Majority commit index
For a replicated log, let match_j be the highest log index known to be stored on replica j. A majority-committed index is the largest index i stored on at least Q_m replicas:
For a proposed entry at index i_e, acknowledgement is safe only if:
Majority intersection:
For N=5:
The intersection property is what prevents a committed replicated-log entry from disappearing in a correct consensus algorithm. It does not by itself prove linearizability: election rules, term checks, log freshness, durable storage, client retry semantics, and read protocol still matter.
Logical clock update rules
For a Lamport clock, local event or send:
On receive of a message with timestamp C_m:
Lamport causality implication:
The reverse is not guaranteed:
For vector clocks with component k owned by process k, local event or send:
On receive of message vector V_m at process k:
then increment the receiver component:
Causal order between two vector timestamps:
Concurrent events:
Logical clocks are evidence models. They do not replace durable ordering, transaction isolation, or consensus.
Transaction isolation screens
Let R_1,W_1 and R_2,W_2 be read and write sets for two concurrent transactions.
Write-write conflict:
Snapshot isolation normally rejects one transaction when this conflict is detected on the same versioned item.
Write-skew risk appears when the transactions read overlapping invariant state but write disjoint items:
and the invariant depends on the combination:
Serialization graph cycle condition:
If the dependency graph contains a cycle, the history is not conflict-serializable.
Approximate version retention for long snapshots:
where \lambda_w is the write rate of retained versions and T_{snapshot} is the maximum snapshot age. Long analytical reads can therefore become storage, garbage-collection, and vacuum pressure even when the transaction logic is correct.
Two-phase commit timing
For two-phase commit with parallel prepare requests, one-way network delay d, participant prepare flush time t_{p,i}, coordinator decision flush time t_c, and commit-message one-way delay d:
Decision durable at the coordinator:
Earliest commit-message delivery:
If participants vote yes and enter prepared state, but no durable final decision is reachable, the transaction can block:
Two-phase commit is an atomicity protocol, not an availability proof. The blocking window should be compared with lock hold limits, version retention, user-visible timeout, recovery objective, and the availability of the coordinator’s decision record.
CRDT convergence checks
For a state-based conflict-free replicated data type with merge function M, convergence requires merge properties such as commutativity:
associativity:
and idempotence:
If every update eventually reaches every non-failed replica and merge has these properties, replicas that have seen the same set of updates converge:
CRDT checks support eventual consistency, not arbitrary invariants. If the application needs linearizability, uniqueness, or cross-object constraints, a mergeable data type may need escrow, reservation, consensus, or a compensating workflow.
Replicated availability
Availability of a repairable component:
Series availability:
Parallel availability when any replica can serve:
Probability that at least q of N identical independent replicas are available:
Independence is a strong assumption. Shared power, common software bugs, shared configuration, shared deployment, network partition, and correlated load can dominate real availability.
Fan-out tail latency
If a request fans out to n independent dependencies and completes only when all dependencies finish, then for dependency latency cumulative distribution F(t):
Tail probability:
If each dependency meets a latency target with probability p, all n dependencies meet it with probability:
Fan-out makes good single-service percentiles look worse at the aggregate boundary. Use hedging, caching, partial results, admission control, or dependency reduction only after checking side effects.
Error budget and burn rate
Allowed error fraction for an availability or success SLO:
Allowed error count over a window with N requests:
Error-budget burn rate:
where e_{observed} is the observed error fraction over the same basis.
If BR>1, the system is consuming error budget faster than allowed. Rollout, release, or capacity changes should have stop rules tied to burn rate, latency, saturation, and data integrity.
Lease and clock margin
If a lease duration is T_L, maximum clock uncertainty is \epsilon, network delay bound is d_n, and process pause bound is t_p, a simple safety margin is:
The lease margin is acceptable only if:
This is a screen, not a consensus proof. Lease safety depends on monotonic clocks, clock synchronization assumptions, pause behavior, storage semantics, renewal rules, and what action is allowed while the lease is believed valid.
Rollout exposure
Canary exposure count:
where f_c is canary fraction and N is total request or user exposure in the review window.
Expected failures at failure probability p_f:
Observed failure fraction:
Do not accept a canary only because no failures were observed when N_c is too small to exercise the risk. Define minimum exposure, representative traffic, stop metrics, rollback triggers, and data compatibility before rollout.
Validation guard margins
For an upper limit:
For a lower limit:
where u_x is standard uncertainty and k is the selected guard factor.
For latency percentiles, state the sample size, percentile estimator, clock source, load condition, and whether coordinated omission is controlled. A green average latency does not validate a tail-latency SLO.
Worked screening example
A distributed service has:
| Quantity | Value |
|---|---|
| external arrival rate | \lambda_0=180\ \text{requests/s} |
| worker count | c=6 |
| mean service time before retries | S=24\ \text{ms} |
| lock entries per request | m=2 |
| lock hold time | t_{cs}=0.8\ \text{ms} |
| retry limit | r=2 |
| dependency failure probability during incident | p_f=0.15 |
| fan-out dependency count | n=8 |
| single dependency target success probability | p=0.995 |
| replica count | N=3 |
| quorum size | q=2 |
| replica availability | A=0.995 |
Worker utilization before retries:
The pool has usable headroom but is not lightly loaded.
Lock utilization:
The lock is not saturated at the base rate. Now include retries:
Effective arrival rate:
Worker utilization under retry load:
The retry policy consumes most of the remaining capacity. If the dependency slows down at the same time, the pool may cross into queue growth.
Fan-out target probability:
Even excellent single-dependency target probability gives only about 96.1\% all-dependency success at that boundary.
Quorum availability for at least two available replicas:
This looks strong under independent failures, but it is not valid for a shared software bug, bad deployment, corrupted configuration, or region-wide network fault.
Engineering comment
The release risk is not a single formula. The service has acceptable base worker and lock utilization, but retries drive utilization close to saturation, fan-out weakens the aggregate target probability, and quorum availability assumes independence. The release decision should require retry backoff limits, dependency circuit breaking, tail-latency telemetry, quorum failure injection, rollback criteria, and validation under degraded dependency behavior.
Release checklist
Before accepting an OS, concurrency, or distributed-system calculation, confirm that:
- the workload boundary and arrival-rate basis are explicit;
- utilization includes retries, background work, interrupts, and dependency slowdown where relevant;
- queue limits are tied to time-to-fill and backpressure behavior;
- critical sections are bounded and lock ordering is enforced;
- timeout and retry budgets fit inside caller deadlines;
- quorum rules are checked against membership, failure, and stale-leader assumptions;
- logical-clock evidence is not confused with consensus or serializable ordering;
- isolation-level calculations include write skew, predicate invariants, and version retention;
- two-phase commit timing includes prepared-state blocking and decision-record recovery;
- CRDT or eventual-consistency designs state which invariants are merge-safe and which are not;
- availability calculations state correlated-failure limits;
- fan-out tail risk is evaluated at the service boundary;
- rollout exposure is large enough to test the claimed risk;
- validation uses percentiles, failure injection, telemetry, and rollback evidence, not only nominal tests.
Common mistakes
Common mistakes include treating CPU average as capacity, adding threads after a lock has saturated, setting retries without idempotency, using majority quorum formulas while ignoring membership changes, treating Lamport timestamps as proof of causality in both directions, assuming snapshot isolation is serializable, and multiplying independent availability numbers across components that share the same deployment or network.
Another frequent mistake is validating only the happy path. Distributed systems fail during partial dependency failure, clock drift, queue buildup, retry storms, version mismatch, prepared transaction recovery, merge conflicts, and leadership changes. The formulas are useful only when they drive explicit limits, tests, telemetry, and operating decisions.