Exercise set
Distributed Systems Consistency, Retry, and Reliability Exercises
Solved distributed-systems exercises for latency budgets, retries, quorums, consensus, failover, consistency models, transactions and release gates.
These exercises focus on distributed-system release calculations: latency budgets, fan-out, retries, circuit breakers, quorums, consensus, failover, consistency models, transactions and reliability gates. Operating-system scheduling and local concurrency are handled in separate specialist exercise sets.
Use the calculations as review aids. Distributed correctness requires failure injection, trace correlation, durable-log checks, partition tests, idempotency evidence, read-path validation, rollback plans and production observability.
Release Evidence Notes
Distributed evidence should show what happens when requests are delayed, duplicated, retried, reordered, partitioned or served by stale replicas. A normal load test is not enough for a system that claims strong consistency or safe failover.
Engineering Boundary Notes
These exercises assume crash faults and ordinary network delay unless stated otherwise. Byzantine behavior, malicious clients, clock synchronization failures, schema migrations and multi-region legal boundaries require separate analysis.
Common Release Mistakes
- treating quorum size as a full consensus protocol;
- retrying non-idempotent commands without a deduplication key;
- validating failover with no concurrent writes;
- proving write durability but not read-path freshness;
- using two-phase commit where indefinite blocking is unacceptable;
- ignoring common dependencies when multiplying availability.
Scenario Map
| Scenario | Main calculation | Release decision |
|---|---|---|
| Latency | serial path, fan-out and timeout budgets | Tune, shed, cache or reject release. |
| Retries | amplification, backoff and idempotency | Set retry policy and circuit breaker. |
| Quorum and consensus | majority, intersection and fencing | Accept or block failover path. |
| Consistency | clocks, stale reads and isolation | Select read/write contract. |
| Reliability | availability and common dependencies | Remove shared failure or document risk. |
Validation Package Checklist
- service graph, timeout budget and retry budget;
- idempotency key, deduplication store and replay test;
- quorum membership, term, commit index and fencing evidence;
- failover timeline under concurrent reads and writes;
- consistency test for stale reads, write skew and causal order;
- post-release alarms for retry storm, stale read, split brain and saturation.
Exercise 1: Distributed Request Latency Budget
A request path has API gateway 18\ \text{ms}, service A 35\ \text{ms}, service B 42\ \text{ms} and database 55\ \text{ms}. Network overhead adds 20\ \text{ms}. Target is 180\ \text{ms}. Does it pass?
Solution
Since 170<180\ \text{ms}, it passes with 10\ \text{ms} margin.
Engineering Comment
The margin is small. Tail latency, retries and queuing could erase it.
Plausibility Check
Five tens-of-milliseconds terms add to just under 180\ \text{ms}.
Exercise 2: Fan-Out Tail Latency
A service fans out to 8 equal dependencies. Each dependency has probability 0.98 of responding within 120\ \text{ms}. Estimate probability all dependencies respond within that time.
Solution
Only about 85.1\% of aggregate requests meet that dependency threshold.
Engineering Comment
Fan-out makes ordinary per-service tail latency visible at the user boundary.
Plausibility Check
Eight chances to be late should reduce the all-pass probability noticeably.
Exercise 3: Retry Amplification
Original traffic is 900\ \text{requests/s}. A client retries failed requests twice. Failure fraction is 12\%. Estimate added retry traffic if every failed attempt is retried exactly twice.
Solution
Total attempted traffic:
Engineering Comment
This ignores retries that also fail. Real retry storms can amplify more severely.
Plausibility Check
About one tenth of traffic retried twice should add about one fifth.
Exercise 4: Jittered Backoff Peak Load
Without jitter, 5000 clients retry in the same 1\ \text{s} window. Jitter spreads retries over 20\ \text{s}. Estimate average retry rate after jitter.
Solution
Engineering Comment
Jitter reduces synchronized bursts, but the service must still tolerate the average retry load.
Plausibility Check
Spreading over twenty seconds reduces the one-second spike by about 20\times.
Exercise 5: Timeout Budget Across Hops
End-to-end timeout is 800\ \text{ms}. Three downstream calls reserve 180, 250 and 210\ \text{ms}. The caller needs 90\ \text{ms} for local processing. Compute remaining slack.
Solution
Engineering Comment
Timeout budgets should be explicit; otherwise nested retries can outlive the client request.
Plausibility Check
The used budget is close to 800\ \text{ms}, leaving a small positive slack.
Exercise 6: Circuit-Breaker Open Threshold
A circuit breaker opens when the failure rate exceeds 35\% over a window of 200 calls. The window has 78 failures. Does it open?
Solution
Since 39\%>35\%, the breaker opens.
Engineering Comment
Breaker thresholds should be paired with half-open probes and user-visible fallback behavior.
Plausibility Check
Seventy-eight failures is more than one third of two hundred.
Exercise 7: Bulkhead Capacity
A service reserves 30 workers for interactive traffic and 12 for batch traffic. Batch receives a burst requiring 25 workers. How many batch requests must wait or be shed if all workers are busy?
Solution
Engineering Comment
Bulkheads protect interactive traffic by refusing to let batch consume the shared pool.
Plausibility Check
Batch demand is more than twice its reserved pool.
Exercise 8: Quorum Intersection
A replicated service has N=5 nodes and writes require quorum Q=3. Show the minimum intersection between two write quorums.
Solution
For two quorums:
Any two write quorums intersect in at least one node.
Engineering Comment
Intersection supports safety only when protocol rules also use term, log freshness and commit conditions correctly.
Plausibility Check
Two groups of three chosen from five must overlap.
Exercise 9: Majority Commit Under Partition
A five-node cluster is partitioned into groups of 3 and 2. Majority commit requires 3 acknowledgements. Which side can commit?
Solution
The group of 3 can reach majority:
The group of 2 cannot:
Engineering Comment
The minority side must reject or redirect writes; otherwise split brain can occur.
Plausibility Check
Only the larger partition has enough nodes for majority.
Exercise 10: Fencing Token Check
Old leader has fencing token 41. New leader has token 42. Storage accepts writes only if the presented token is at least the stored token. Which leader can write after token 42 is stored?
Solution
Old leader:
New leader:
Only the new leader can write.
Engineering Comment
Leader election is not enough unless old leaders are fenced at every state-changing dependency.
Plausibility Check
The larger token represents newer authority.
Exercise 11: Replicated Availability With Shared Dependency
Two replicas each have availability 0.99, but both depend on one shared database with availability 0.98. Assuming replica independence and shared database dependency, estimate service availability.
Solution
Replica layer availability:
Including shared database:
Engineering Comment
The shared dependency dominates. Replicating stateless nodes does not remove a single shared failure.
Plausibility Check
The result is close to database availability because the database is required.
Exercise 12: Leader-Election Failover Time
Failure detection takes 4.5\ \text{s}, leader election takes 1.2\ \text{s} and cache warm-up takes 3.0\ \text{s}. The service objective requires recovery under 10\ \text{s}. Does it pass?
Solution
Since 8.7<10\ \text{s}, it passes with 1.3\ \text{s} margin.
Engineering Comment
The margin should be tested under traffic and with durable-log replay included.
Plausibility Check
The three components sum to less than ten seconds.
Exercise 13: Vector-Clock Concurrency
Replica A observes vector clock (3,2) and replica B observes (2,4). Are the events causally ordered?
Solution
For A to precede B, every component of A must be less than or equal to B:
For B to precede A:
The events are concurrent.
Engineering Comment
Concurrent updates require a merge rule, conflict resolution or rejection path.
Plausibility Check
Each vector is ahead in one component.
Exercise 14: Lamport Clock Ordering
Event X has Lamport timestamp 17 and event Y has timestamp 23. Can the timestamps alone prove that X caused Y?
Solution
No. Lamport clocks provide an order compatible with causality:
But L(X)<L(Y) does not prove causation.
Engineering Comment
Use vector clocks, trace context or protocol evidence when causal dependency matters.
Plausibility Check
A smaller Lamport timestamp can occur without a direct causal path.
Exercise 15: Eventual-Consistency Stale Window
Replication lag p99 is 850\ \text{ms}. Product promise allows stale reads for at most 500\ \text{ms}. Does the read contract pass?
Solution
The stale-read contract fails at p99.
Engineering Comment
Options include read-your-writes routing, stronger reads, lag hiding or a weaker product promise.
Plausibility Check
The measured lag tail is larger than the allowed stale window.
Exercise 16: Snapshot-Isolation Write Skew
Two doctors are on call. Two transactions each read that both are on call, then each turns one different doctor off call. Under snapshot isolation, what invariant can fail?
Solution
Each transaction sees the original state:
After both commit:
The invariant “at least one doctor is on call” fails.
Engineering Comment
Snapshot isolation is not serializability. Invariant-preserving transactions may need explicit locks, constraints or serializable isolation.
Plausibility Check
Each transaction changes a different row, so row conflict detection may not catch the global invariant violation.
Exercise 17: Two-Phase-Commit Blocking Window
A participant enters prepared state. Coordinator recovery takes 45\ \text{s} and participant lock timeout is 30\ \text{s}. Does the participant risk timing out before outcome is known?
Solution
Yes. The participant can time out before the coordinator recovers.
Engineering Comment
Two-phase commit can block. The design needs a recovery record, timeout semantics and business-safe compensation path.
Plausibility Check
Coordinator recovery is longer than the participant timeout.
Exercise 18: Distributed Release Gate
A release candidate has end-to-end latency 170\ \text{ms} against 180\ \text{ms} target, retry amplification to 1116\ \text{requests/s} against 1050\ \text{requests/s} safe capacity, quorum majority behavior correct, stale-read p99 850\ \text{ms} against 500\ \text{ms} promise and two-phase-commit recovery 45\ \text{s} against 30\ \text{s} participant timeout. Decide release status.
Solution
Retry capacity fails:
Stale-read promise fails:
Transaction recovery timing fails:
Release should be held.
Engineering Comment
Correct majority behavior does not compensate for retry overload, violated read promises or transaction blocking risk.
Plausibility Check
Three independent distributed-system gates fail, so the decision is unambiguous.