Project
Replicated Log Failover Consistency Validation Project
Computer engineering project for validating replicated-log failover with quorum checks, linearizability history, fencing, partition tests, recovery replay, telemetry, and release evidence.
This project builds a validation package for a replicated-log service that must survive leader failure and network partition without losing committed commands or accepting split-brain writes. The final deliverable is a reviewable engineering package: consistency contract, quorum calculations, failover test matrix, linearizability history, fencing evidence, replay checks, telemetry requirements, residual risks, and release decision.
The project is not a tutorial for a specific consensus implementation. It treats distributed consistency as an engineering validation problem. A system can pass ordinary availability tests and still corrupt state if an old leader can write, if a committed log entry disappears during failover, if reads are served from stale replicas, or if client retries create duplicate commands.
Project Objective
Validate a five-node replicated-log cluster used by a command service. The release decision must answer:
- Can the service preserve committed log entries through leader crash, follower lag, restart, and partition?
- Can a stale leader be fenced before it writes to storage, actuators, or downstream queues?
- Do acknowledged writes and reads satisfy the claimed linearizability contract?
- Does recovery replay restore state without duplicate command execution?
- What telemetry and evidence are required before production rollout?
The deliverable is a failover consistency validation report with test data, operation histories, log excerpts, fencing-token evidence, recovery replay summary, latency and error-budget impact, and release acceptance statement.
System Under Test
Use the following simplified service.
| Item | Project value |
|---|---|
| replica count | N=5 |
| consensus role | one leader, four followers |
| command rate during validation | \lambda_w=120\ \text{writes/s} |
| read rate during validation | \lambda_r=180\ \text{reads/s} |
| target p99 write latency | 180\ \text{ms} |
| target p99 read latency | 90\ \text{ms} |
| maximum accepted committed-entry loss | zero |
| maximum accepted split-brain write count | zero |
| client retry limit | two retries with jittered backoff |
| recovery objective after leader crash | T_{recover}\leq 3\ \text{s} |
| validation mode | staged fault injection before production rollout |
The state machine applies each command exactly once by command id:
Duplicate execution is a release failure:
Missing committed execution is also a release failure:
Consistency Contract
The validation contract is intentionally narrow and testable:
- a write is acknowledged only after the command is durably replicated to a majority;
- a read that claims linearizability must observe all writes that completed before the read began;
- a leader with an old term or fencing token cannot write to the protected resource;
- recovery replay is idempotent by command id;
- during a partition, the minority side must reject writes or enter degraded read-only behavior.
For operation history checking, if operation a completes before operation b starts:
then the linearized order must place a before b:
This is stronger than eventual consistency. If the product chooses eventual consistency for a specific endpoint, the endpoint must say so and must not be validated against the same linearizable read contract.
Step 1: Quorum and Failure Screen
Majority quorum size:
With N=5:
Crash faults tolerated by majority quorum:
Any two majorities intersect:
Substitute:
The project should not stop at this calculation. Majority geometry supports safety, but real safety also depends on election restrictions, durable log persistence, fencing, client retry semantics, and membership-change rules.
Step 2: Commit Index Evidence
Let match_j be the highest replicated log index known to be stored on replica j. The commit index is:
Before fault injection, record:
| Replica | match_j |
|---|---|
| leader | 12480 |
| follower 1 | 12480 |
| follower 2 | 12480 |
| follower 3 | 12472 |
| follower 4 | 12450 |
For index 12480, three replicas have the entry:
Therefore:
The validation report must preserve this number across failover. After a new leader is elected, the test must prove:
If the new leader exposes a lower committed state, the system has failed the release gate.
Step 3: Partition Test Matrix
Run controlled partition tests with traffic, not idle nodes.
| Test | Fault | Expected result | Required evidence |
|---|---|---|---|
| P1 | isolate two followers | majority side continues | write acknowledgements include at least three replicas |
| P2 | isolate old leader with one follower | old leader rejects writes | client errors, no protected-resource writes, stale token rejected |
| P3 | isolate three-node side from two-node side | three-node side may elect leader | term change, commit index continuity, no minority writes |
| P4 | heal partition after divergent uncommitted entries | uncommitted minority entries are discarded or overwritten safely | log reconciliation trace and command-id audit |
| P5 | partition during read load | linearizable reads use leader/quorum path or are rejected | read-index evidence or explicit degraded response |
The critical observation is not only which side is available. It is whether the unavailable or stale side is proven unable to act.
Step 4: Failover Timing and Error Budget
A simplified failover time model is:
Use:
| Component | Value |
|---|---|
| failure detection | T_{detect}=1.2\ \text{s} |
| election | T_{elect}=0.45\ \text{s} |
| fencing confirmation | T_{fence}=0.35\ \text{s} |
| log catch-up before serving writes | T_{catchup}=0.60\ \text{s} |
Then:
Compare with the recovery objective:
The timing screen passes. The consistency screen still needs evidence that no stale authority wrote during those 2.60\ \text{s}.
If the monthly error budget for the service is:
then one failover consumes:
or about:
of the monthly downtime budget, assuming the failover is customer-visible downtime.
Step 5: Fencing Validation
Fencing prevents a stale leader from continuing to own the protected resource. Model leadership authority with a monotonically increasing fencing epoch:
The protected resource accepts a command only when:
During the partition test, force the old leader to attempt a write with:
after the new leader has established:
The expected outcome is:
so the write is rejected.
The validation evidence should include the rejected write, resource-side log, command id, epoch value, time source, leader id, and proof that the rejection happened at the resource boundary. A coordinator log alone is not enough if the stale side can still write directly.
Step 6: Replay and Duplicate Command Check
After failover and restart, replay all committed entries from the last durable snapshot. If the snapshot covers:
and the commit index is:
then the replay range is:
If replay rate is:
estimated replay time is:
This exceeds the 0.60\ \text{s} catch-up allowance used in the failover timing screen. The engineering decision is not to hide the mismatch. The team must either reduce snapshot interval, increase replay throughput, avoid serving until catch-up completes, or revise the recovery objective.
For idempotent application, every command id should satisfy:
for committed applied commands. For retried client commands, multiple submissions are acceptable only if the state machine still applies the command once.
Step 7: Linearizability History Check
Capture operation histories with invocation time, response time, client id, command id, leader id, term, commit index, result, and observed state.
Minimum fields:
| Field | Why it matters |
|---|---|
| invocation timestamp | orders real-time operation starts |
| response timestamp | detects real-time precedence |
| command id | detects duplicate or missing application |
| term and leader id | detects stale authority |
| commit index | ties response to replicated log state |
| read observed version | proves read freshness or exposes stale reads |
| result code | separates rejected degraded behavior from false success |
A compact consistency assertion for a read r is:
for every write w whose response completed before the read invocation:
If this condition fails and the endpoint claimed linearizability, the release fails. If the endpoint is explicitly eventually consistent, the report must state the staleness bound, repair mechanism, user-visible behavior, and conflict policy.
Step 8: Release Test Plan
The release test plan should include:
- baseline majority commit under steady load;
- leader crash during write traffic;
- follower lag and catch-up before leader election;
- minority partition write rejection;
- stale leader fencing at the protected resource;
- partition heal with uncommitted log reconciliation;
- client retry with duplicate command id;
- read freshness during and after failover;
- snapshot recovery and replay timing;
- telemetry review after canary rollout.
Every test must state expected behavior before it runs. A test that only observes what happened after the fact is weaker evidence than a test with a declared release gate.
Deliverable Template
The final package should contain:
- consistency contract and endpoints covered;
- cluster topology, quorum size, membership rule and fault assumptions;
- test matrix with fault injection method and expected outcome;
- operation-history file or summarized linearizability check;
- fencing-token evidence from the protected resource;
- commit-index continuity before and after failover;
- replay and duplicate-command audit;
- p95, p99 and worst observed latency during tests;
- error-budget impact and rollback trigger;
- residual risks and release decision.
Acceptance Criteria
Accept the release only if:
- acknowledged committed entries are not lost after failover;
- minority or stale leaders cannot write to the protected resource;
- linearizable reads observe completed writes or are explicitly rejected;
- retries do not duplicate command execution;
- recovery replay completes before the service advertises full recovery;
- p99 latency and error-budget impact remain inside release limits;
- telemetry can distinguish original requests, retries, rejections, stale-leader attempts, and replay work.
Common Failure Modes
Common failures include accepting writes on both sides of a partition, treating leader election as fencing, serving linearizable reads from a lagging follower, acknowledging writes before durable majority replication, retrying commands without idempotency keys, compacting logs before a slow follower has a safe snapshot, and validating failover with no traffic.
Another failure is assuming that a healthy canary proves consistency. Many consistency bugs appear only during a specific interleaving: write in flight, leader crash, stale retry, partition heal, or replay after snapshot. The validation package must therefore capture the history around the fault, not only aggregate success rate.
Engineering Limitations
This project is a validation framework, not a formal proof. It does not replace model checking, protocol review, security review, storage durability testing, membership-change analysis, or production incident learning. It gives the engineering team a practical release package: what must be measured, which histories must be preserved, and what evidence is strong enough to claim safe failover for a replicated-log service.