Exercise set

Distributed Systems Consistency, Retry, and Reliability Exercises

Solved distributed-systems exercises for latency budgets, retries, quorums, consensus, failover, consistency models, transactions and release gates.

Branch: Computer Engineering
Content: Exercise set
Updated: Jul 03, 2026
Revision: v1.0.0 · reviewed

These exercises focus on distributed-system release calculations: latency budgets, fan-out, retries, circuit breakers, quorums, consensus, failover, consistency models, transactions and reliability gates. Operating-system scheduling and local concurrency are handled in separate specialist exercise sets.

Use the calculations as review aids. Distributed correctness requires failure injection, trace correlation, durable-log checks, partition tests, idempotency evidence, read-path validation, rollback plans and production observability.

Release Evidence Notes

Distributed evidence should show what happens when requests are delayed, duplicated, retried, reordered, partitioned or served by stale replicas. A normal load test is not enough for a system that claims strong consistency or safe failover.

Engineering Boundary Notes

These exercises assume crash faults and ordinary network delay unless stated otherwise. Byzantine behavior, malicious clients, clock synchronization failures, schema migrations and multi-region legal boundaries require separate analysis.

Common Release Mistakes

treating quorum size as a full consensus protocol;
retrying non-idempotent commands without a deduplication key;
validating failover with no concurrent writes;
proving write durability but not read-path freshness;
using two-phase commit where indefinite blocking is unacceptable;
ignoring common dependencies when multiplying availability.

Scenario Map

Scenario	Main calculation	Release decision
Latency	serial path, fan-out and timeout budgets	Tune, shed, cache or reject release.
Retries	amplification, backoff and idempotency	Set retry policy and circuit breaker.
Quorum and consensus	majority, intersection and fencing	Accept or block failover path.
Consistency	clocks, stale reads and isolation	Select read/write contract.
Reliability	availability and common dependencies	Remove shared failure or document risk.

Validation Package Checklist

service graph, timeout budget and retry budget;
idempotency key, deduplication store and replay test;
quorum membership, term, commit index and fencing evidence;
failover timeline under concurrent reads and writes;
consistency test for stale reads, write skew and causal order;
post-release alarms for retry storm, stale read, split brain and saturation.

Exercise 1: Distributed Request Latency Budget

A request path has API gateway $18\ \text{ms}$ , service A $35\ \text{ms}$ , service B $42\ \text{ms}$ and database $55\ \text{ms}$ . Network overhead adds $20\ \text{ms}$ . Target is $180\ \text{ms}$ . Does it pass?

Solution

t=18+35+42+55+20=170\ \text{ms}

Since $170<180\ \text{ms}$ , it passes with $10\ \text{ms}$ margin.

Engineering Comment

The margin is small. Tail latency, retries and queuing could erase it.

Plausibility Check

Five tens-of-milliseconds terms add to just under $180\ \text{ms}$ .

Exercise 2: Fan-Out Tail Latency

A service fans out to $8$ equal dependencies. Each dependency has probability $0.98$ of responding within $120\ \text{ms}$ . Estimate probability all dependencies respond within that time.

Solution

P_{all}=0.98^8=0.8508

Only about $85.1\%$ of aggregate requests meet that dependency threshold.

Engineering Comment

Fan-out makes ordinary per-service tail latency visible at the user boundary.

Plausibility Check

Eight chances to be late should reduce the all-pass probability noticeably.

Exercise 3: Retry Amplification

Original traffic is $900\ \text{requests/s}$ . A client retries failed requests twice. Failure fraction is $12\%$ . Estimate added retry traffic if every failed attempt is retried exactly twice.

Solution

\lambda_{retry}=900(0.12)(2)=216\ \text{requests/s}

Total attempted traffic:

\lambda_{total}=900+216=1116\ \text{requests/s}

Engineering Comment

This ignores retries that also fail. Real retry storms can amplify more severely.

Plausibility Check

About one tenth of traffic retried twice should add about one fifth.

Exercise 4: Jittered Backoff Peak Load

Without jitter, $5000$ clients retry in the same $1\ \text{s}$ window. Jitter spreads retries over $20\ \text{s}$ . Estimate average retry rate after jitter.

Solution

\lambda=\dfrac{5000}{20}=250\ \text{retries/s}

Engineering Comment

Jitter reduces synchronized bursts, but the service must still tolerate the average retry load.

Plausibility Check

Spreading over twenty seconds reduces the one-second spike by about $20\times$ .

Exercise 5: Timeout Budget Across Hops

End-to-end timeout is $800\ \text{ms}$ . Three downstream calls reserve $180$ , $250$ and $210\ \text{ms}$ . The caller needs $90\ \text{ms}$ for local processing. Compute remaining slack.

Solution

t_{used}=180+250+210+90=730\ \text{ms}

t_{slack}=800-730=70\ \text{ms}

Engineering Comment

Timeout budgets should be explicit; otherwise nested retries can outlive the client request.

Plausibility Check

The used budget is close to $800\ \text{ms}$ , leaving a small positive slack.

Exercise 6: Circuit-Breaker Open Threshold

A circuit breaker opens when the failure rate exceeds $35\%$ over a window of $200$ calls. The window has $78$ failures. Does it open?

Solution

f=\dfrac{78}{200}=0.39=39\%

Since $39\%>35\%$ , the breaker opens.

Engineering Comment

Breaker thresholds should be paired with half-open probes and user-visible fallback behavior.

Plausibility Check

Seventy-eight failures is more than one third of two hundred.

Exercise 7: Bulkhead Capacity

A service reserves $30$ workers for interactive traffic and $12$ for batch traffic. Batch receives a burst requiring $25$ workers. How many batch requests must wait or be shed if all workers are busy?

Solution

n_{excess}=25-12=13

Engineering Comment

Bulkheads protect interactive traffic by refusing to let batch consume the shared pool.

Plausibility Check

Batch demand is more than twice its reserved pool.

Exercise 8: Quorum Intersection

A replicated service has $N=5$ nodes and writes require quorum $Q=3$ . Show the minimum intersection between two write quorums.

Solution

For two quorums:

I_{min}=2Q-N=2(3)-5=1

Any two write quorums intersect in at least one node.

Engineering Comment

Intersection supports safety only when protocol rules also use term, log freshness and commit conditions correctly.

Plausibility Check

Two groups of three chosen from five must overlap.

Exercise 9: Majority Commit Under Partition

A five-node cluster is partitioned into groups of $3$ and $2$ . Majority commit requires $3$ acknowledgements. Which side can commit?

Solution

The group of $3$ can reach majority:

3\ge 3

The group of $2$ cannot:

2<3

Engineering Comment

The minority side must reject or redirect writes; otherwise split brain can occur.

Plausibility Check

Only the larger partition has enough nodes for majority.

Exercise 10: Fencing Token Check

Old leader has fencing token $41$ . New leader has token $42$ . Storage accepts writes only if the presented token is at least the stored token. Which leader can write after token $42$ is stored?

Solution

Old leader:

41<42

New leader:

42\ge 42

Only the new leader can write.

Engineering Comment

Leader election is not enough unless old leaders are fenced at every state-changing dependency.

Plausibility Check

The larger token represents newer authority.

Exercise 11: Replicated Availability With Shared Dependency

Two replicas each have availability $0.99$ , but both depend on one shared database with availability $0.98$ . Assuming replica independence and shared database dependency, estimate service availability.

Solution

Replica layer availability:

A_r=1-(1-0.99)^2=0.9999

Including shared database:

A=A_r(0.98)=0.9799

Engineering Comment

The shared dependency dominates. Replicating stateless nodes does not remove a single shared failure.

Plausibility Check

The result is close to database availability because the database is required.

Exercise 12: Leader-Election Failover Time

Failure detection takes $4.5\ \text{s}$ , leader election takes $1.2\ \text{s}$ and cache warm-up takes $3.0\ \text{s}$ . The service objective requires recovery under $10\ \text{s}$ . Does it pass?

Solution

t_{failover}=4.5+1.2+3.0=8.7\ \text{s}

Since $8.7<10\ \text{s}$ , it passes with $1.3\ \text{s}$ margin.

Engineering Comment

The margin should be tested under traffic and with durable-log replay included.

Plausibility Check

The three components sum to less than ten seconds.

Exercise 13: Vector-Clock Concurrency

Replica A observes vector clock $(3,2)$ and replica B observes $(2,4)$ . Are the events causally ordered?

Solution

For A to precede B, every component of A must be less than or equal to B:

3\le2\quad \text{is false}

For B to precede A:

4\le2\quad \text{is false}

The events are concurrent.

Engineering Comment

Concurrent updates require a merge rule, conflict resolution or rejection path.

Plausibility Check

Each vector is ahead in one component.

Exercise 14: Lamport Clock Ordering

Event X has Lamport timestamp $17$ and event Y has timestamp $23$ . Can the timestamps alone prove that X caused Y?

Solution

No. Lamport clocks provide an order compatible with causality:

X\rightarrow Y \Rightarrow L(X)<L(Y)

But $L(X)<L(Y)$ does not prove causation.

Engineering Comment

Use vector clocks, trace context or protocol evidence when causal dependency matters.

Plausibility Check

A smaller Lamport timestamp can occur without a direct causal path.

Exercise 15: Eventual-Consistency Stale Window

Replication lag $p99$ is $850\ \text{ms}$ . Product promise allows stale reads for at most $500\ \text{ms}$ . Does the read contract pass?

Solution

850\ \text{ms}>500\ \text{ms}

The stale-read contract fails at $p99$ .

Engineering Comment

Options include read-your-writes routing, stronger reads, lag hiding or a weaker product promise.

Plausibility Check

The measured lag tail is larger than the allowed stale window.

Exercise 16: Snapshot-Isolation Write Skew

Two doctors are on call. Two transactions each read that both are on call, then each turns one different doctor off call. Under snapshot isolation, what invariant can fail?

Solution

Each transaction sees the original state:

D_1=\text{on},\quad D_2=\text{on}

After both commit:

D_1=\text{off},\quad D_2=\text{off}

The invariant “at least one doctor is on call” fails.

Engineering Comment

Snapshot isolation is not serializability. Invariant-preserving transactions may need explicit locks, constraints or serializable isolation.

Plausibility Check

Each transaction changes a different row, so row conflict detection may not catch the global invariant violation.

Exercise 17: Two-Phase-Commit Blocking Window

A participant enters prepared state. Coordinator recovery takes $45\ \text{s}$ and participant lock timeout is $30\ \text{s}$ . Does the participant risk timing out before outcome is known?

Solution

45\ \text{s}>30\ \text{s}

Yes. The participant can time out before the coordinator recovers.

Engineering Comment

Two-phase commit can block. The design needs a recovery record, timeout semantics and business-safe compensation path.

Plausibility Check

Coordinator recovery is longer than the participant timeout.

Exercise 18: Distributed Release Gate

A release candidate has end-to-end latency $170\ \text{ms}$ against $180\ \text{ms}$ target, retry amplification to $1116\ \text{requests/s}$ against $1050\ \text{requests/s}$ safe capacity, quorum majority behavior correct, stale-read $p99$ $850\ \text{ms}$ against $500\ \text{ms}$ promise and two-phase-commit recovery $45\ \text{s}$ against $30\ \text{s}$ participant timeout. Decide release status.

Solution

Retry capacity fails:

1116>1050

Stale-read promise fails:

850>500\ \text{ms}

Transaction recovery timing fails:

45>30\ \text{s}

Release should be held.

Engineering Comment

Correct majority behavior does not compensate for retry overload, violated read promises or transaction blocking risk.

Plausibility Check

Three independent distributed-system gates fail, so the decision is unambiguous.

REF