Exercise set

Distributed Systems Consistency, Retry, and Reliability Exercises

Solved distributed-systems exercises for latency budgets, retries, quorums, consensus, failover, consistency models, transactions and release gates.

These exercises focus on distributed-system release calculations: latency budgets, fan-out, retries, circuit breakers, quorums, consensus, failover, consistency models, transactions and reliability gates. Operating-system scheduling and local concurrency are handled in separate specialist exercise sets.

Use the calculations as review aids. Distributed correctness requires failure injection, trace correlation, durable-log checks, partition tests, idempotency evidence, read-path validation, rollback plans and production observability.

Release Evidence Notes

Distributed evidence should show what happens when requests are delayed, duplicated, retried, reordered, partitioned or served by stale replicas. A normal load test is not enough for a system that claims strong consistency or safe failover.

Engineering Boundary Notes

These exercises assume crash faults and ordinary network delay unless stated otherwise. Byzantine behavior, malicious clients, clock synchronization failures, schema migrations and multi-region legal boundaries require separate analysis.

Common Release Mistakes

  • treating quorum size as a full consensus protocol;
  • retrying non-idempotent commands without a deduplication key;
  • validating failover with no concurrent writes;
  • proving write durability but not read-path freshness;
  • using two-phase commit where indefinite blocking is unacceptable;
  • ignoring common dependencies when multiplying availability.

Scenario Map

ScenarioMain calculationRelease decision
Latencyserial path, fan-out and timeout budgetsTune, shed, cache or reject release.
Retriesamplification, backoff and idempotencySet retry policy and circuit breaker.
Quorum and consensusmajority, intersection and fencingAccept or block failover path.
Consistencyclocks, stale reads and isolationSelect read/write contract.
Reliabilityavailability and common dependenciesRemove shared failure or document risk.

Validation Package Checklist

  • service graph, timeout budget and retry budget;
  • idempotency key, deduplication store and replay test;
  • quorum membership, term, commit index and fencing evidence;
  • failover timeline under concurrent reads and writes;
  • consistency test for stale reads, write skew and causal order;
  • post-release alarms for retry storm, stale read, split brain and saturation.

Exercise 1: Distributed Request Latency Budget

A request path has API gateway 18\ \text{ms}, service A 35\ \text{ms}, service B 42\ \text{ms} and database 55\ \text{ms}. Network overhead adds 20\ \text{ms}. Target is 180\ \text{ms}. Does it pass?

Solution

t=18+35+42+55+20=170\ \text{ms}

Since 170<180\ \text{ms}, it passes with 10\ \text{ms} margin.

Engineering Comment

The margin is small. Tail latency, retries and queuing could erase it.

Plausibility Check

Five tens-of-milliseconds terms add to just under 180\ \text{ms}.

Exercise 2: Fan-Out Tail Latency

A service fans out to 8 equal dependencies. Each dependency has probability 0.98 of responding within 120\ \text{ms}. Estimate probability all dependencies respond within that time.

Solution

P_{all}=0.98^8=0.8508

Only about 85.1\% of aggregate requests meet that dependency threshold.

Engineering Comment

Fan-out makes ordinary per-service tail latency visible at the user boundary.

Plausibility Check

Eight chances to be late should reduce the all-pass probability noticeably.

Exercise 3: Retry Amplification

Original traffic is 900\ \text{requests/s}. A client retries failed requests twice. Failure fraction is 12\%. Estimate added retry traffic if every failed attempt is retried exactly twice.

Solution

\lambda_{retry}=900(0.12)(2)=216\ \text{requests/s}

Total attempted traffic:

\lambda_{total}=900+216=1116\ \text{requests/s}

Engineering Comment

This ignores retries that also fail. Real retry storms can amplify more severely.

Plausibility Check

About one tenth of traffic retried twice should add about one fifth.

Exercise 4: Jittered Backoff Peak Load

Without jitter, 5000 clients retry in the same 1\ \text{s} window. Jitter spreads retries over 20\ \text{s}. Estimate average retry rate after jitter.

Solution

\lambda=\dfrac{5000}{20}=250\ \text{retries/s}

Engineering Comment

Jitter reduces synchronized bursts, but the service must still tolerate the average retry load.

Plausibility Check

Spreading over twenty seconds reduces the one-second spike by about 20\times.

Exercise 5: Timeout Budget Across Hops

End-to-end timeout is 800\ \text{ms}. Three downstream calls reserve 180, 250 and 210\ \text{ms}. The caller needs 90\ \text{ms} for local processing. Compute remaining slack.

Solution

t_{used}=180+250+210+90=730\ \text{ms}
t_{slack}=800-730=70\ \text{ms}

Engineering Comment

Timeout budgets should be explicit; otherwise nested retries can outlive the client request.

Plausibility Check

The used budget is close to 800\ \text{ms}, leaving a small positive slack.

Exercise 6: Circuit-Breaker Open Threshold

A circuit breaker opens when the failure rate exceeds 35\% over a window of 200 calls. The window has 78 failures. Does it open?

Solution

f=\dfrac{78}{200}=0.39=39\%

Since 39\%>35\%, the breaker opens.

Engineering Comment

Breaker thresholds should be paired with half-open probes and user-visible fallback behavior.

Plausibility Check

Seventy-eight failures is more than one third of two hundred.

Exercise 7: Bulkhead Capacity

A service reserves 30 workers for interactive traffic and 12 for batch traffic. Batch receives a burst requiring 25 workers. How many batch requests must wait or be shed if all workers are busy?

Solution

n_{excess}=25-12=13

Engineering Comment

Bulkheads protect interactive traffic by refusing to let batch consume the shared pool.

Plausibility Check

Batch demand is more than twice its reserved pool.

Exercise 8: Quorum Intersection

A replicated service has N=5 nodes and writes require quorum Q=3. Show the minimum intersection between two write quorums.

Solution

For two quorums:

I_{min}=2Q-N=2(3)-5=1

Any two write quorums intersect in at least one node.

Engineering Comment

Intersection supports safety only when protocol rules also use term, log freshness and commit conditions correctly.

Plausibility Check

Two groups of three chosen from five must overlap.

Exercise 9: Majority Commit Under Partition

A five-node cluster is partitioned into groups of 3 and 2. Majority commit requires 3 acknowledgements. Which side can commit?

Solution

The group of 3 can reach majority:

3\ge 3

The group of 2 cannot:

2<3

Engineering Comment

The minority side must reject or redirect writes; otherwise split brain can occur.

Plausibility Check

Only the larger partition has enough nodes for majority.

Exercise 10: Fencing Token Check

Old leader has fencing token 41. New leader has token 42. Storage accepts writes only if the presented token is at least the stored token. Which leader can write after token 42 is stored?

Solution

Old leader:

41<42

New leader:

42\ge 42

Only the new leader can write.

Engineering Comment

Leader election is not enough unless old leaders are fenced at every state-changing dependency.

Plausibility Check

The larger token represents newer authority.

Exercise 11: Replicated Availability With Shared Dependency

Two replicas each have availability 0.99, but both depend on one shared database with availability 0.98. Assuming replica independence and shared database dependency, estimate service availability.

Solution

Replica layer availability:

A_r=1-(1-0.99)^2=0.9999

Including shared database:

A=A_r(0.98)=0.9799

Engineering Comment

The shared dependency dominates. Replicating stateless nodes does not remove a single shared failure.

Plausibility Check

The result is close to database availability because the database is required.

Exercise 12: Leader-Election Failover Time

Failure detection takes 4.5\ \text{s}, leader election takes 1.2\ \text{s} and cache warm-up takes 3.0\ \text{s}. The service objective requires recovery under 10\ \text{s}. Does it pass?

Solution

t_{failover}=4.5+1.2+3.0=8.7\ \text{s}

Since 8.7<10\ \text{s}, it passes with 1.3\ \text{s} margin.

Engineering Comment

The margin should be tested under traffic and with durable-log replay included.

Plausibility Check

The three components sum to less than ten seconds.

Exercise 13: Vector-Clock Concurrency

Replica A observes vector clock (3,2) and replica B observes (2,4). Are the events causally ordered?

Solution

For A to precede B, every component of A must be less than or equal to B:

3\le2\quad \text{is false}

For B to precede A:

4\le2\quad \text{is false}

The events are concurrent.

Engineering Comment

Concurrent updates require a merge rule, conflict resolution or rejection path.

Plausibility Check

Each vector is ahead in one component.

Exercise 14: Lamport Clock Ordering

Event X has Lamport timestamp 17 and event Y has timestamp 23. Can the timestamps alone prove that X caused Y?

Solution

No. Lamport clocks provide an order compatible with causality:

X\rightarrow Y \Rightarrow L(X)<L(Y)

But L(X)<L(Y) does not prove causation.

Engineering Comment

Use vector clocks, trace context or protocol evidence when causal dependency matters.

Plausibility Check

A smaller Lamport timestamp can occur without a direct causal path.

Exercise 15: Eventual-Consistency Stale Window

Replication lag p99 is 850\ \text{ms}. Product promise allows stale reads for at most 500\ \text{ms}. Does the read contract pass?

Solution

850\ \text{ms}>500\ \text{ms}

The stale-read contract fails at p99.

Engineering Comment

Options include read-your-writes routing, stronger reads, lag hiding or a weaker product promise.

Plausibility Check

The measured lag tail is larger than the allowed stale window.

Exercise 16: Snapshot-Isolation Write Skew

Two doctors are on call. Two transactions each read that both are on call, then each turns one different doctor off call. Under snapshot isolation, what invariant can fail?

Solution

Each transaction sees the original state:

D_1=\text{on},\quad D_2=\text{on}

After both commit:

D_1=\text{off},\quad D_2=\text{off}

The invariant “at least one doctor is on call” fails.

Engineering Comment

Snapshot isolation is not serializability. Invariant-preserving transactions may need explicit locks, constraints or serializable isolation.

Plausibility Check

Each transaction changes a different row, so row conflict detection may not catch the global invariant violation.

Exercise 17: Two-Phase-Commit Blocking Window

A participant enters prepared state. Coordinator recovery takes 45\ \text{s} and participant lock timeout is 30\ \text{s}. Does the participant risk timing out before outcome is known?

Solution

45\ \text{s}>30\ \text{s}

Yes. The participant can time out before the coordinator recovers.

Engineering Comment

Two-phase commit can block. The design needs a recovery record, timeout semantics and business-safe compensation path.

Plausibility Check

Coordinator recovery is longer than the participant timeout.

Exercise 18: Distributed Release Gate

A release candidate has end-to-end latency 170\ \text{ms} against 180\ \text{ms} target, retry amplification to 1116\ \text{requests/s} against 1050\ \text{requests/s} safe capacity, quorum majority behavior correct, stale-read p99 850\ \text{ms} against 500\ \text{ms} promise and two-phase-commit recovery 45\ \text{s} against 30\ \text{s} participant timeout. Decide release status.

Solution

Retry capacity fails:

1116>1050

Stale-read promise fails:

850>500\ \text{ms}

Transaction recovery timing fails:

45>30\ \text{s}

Release should be held.

Engineering Comment

Correct majority behavior does not compensate for retry overload, violated read promises or transaction blocking risk.

Plausibility Check

Three independent distributed-system gates fail, so the decision is unambiguous.

REF

See also