Project

Replicated Log Failover Consistency Validation Project

Computer engineering project for validating replicated-log failover with quorum checks, linearizability history, fencing, partition tests, recovery replay, telemetry, and release evidence.

This project builds a validation package for a replicated-log service that must survive leader failure and network partition without losing committed commands or accepting split-brain writes. The final deliverable is a reviewable engineering package: consistency contract, quorum calculations, failover test matrix, linearizability history, fencing evidence, replay checks, telemetry requirements, residual risks, and release decision.

The project is not a tutorial for a specific consensus implementation. It treats distributed consistency as an engineering validation problem. A system can pass ordinary availability tests and still corrupt state if an old leader can write, if a committed log entry disappears during failover, if reads are served from stale replicas, or if client retries create duplicate commands.

Project Objective

Validate a five-node replicated-log cluster used by a command service. The release decision must answer:

  1. Can the service preserve committed log entries through leader crash, follower lag, restart, and partition?
  2. Can a stale leader be fenced before it writes to storage, actuators, or downstream queues?
  3. Do acknowledged writes and reads satisfy the claimed linearizability contract?
  4. Does recovery replay restore state without duplicate command execution?
  5. What telemetry and evidence are required before production rollout?

The deliverable is a failover consistency validation report with test data, operation histories, log excerpts, fencing-token evidence, recovery replay summary, latency and error-budget impact, and release acceptance statement.

System Under Test

Use the following simplified service.

ItemProject value
replica countN=5
consensus roleone leader, four followers
command rate during validation\lambda_w=120\ \text{writes/s}
read rate during validation\lambda_r=180\ \text{reads/s}
target p99 write latency180\ \text{ms}
target p99 read latency90\ \text{ms}
maximum accepted committed-entry losszero
maximum accepted split-brain write countzero
client retry limittwo retries with jittered backoff
recovery objective after leader crashT_{recover}\leq 3\ \text{s}
validation modestaged fault injection before production rollout

The state machine applies each command exactly once by command id:

C_{applied}=\left|\{cmd\_id:\ cmd\_id\ \text{applied once}\}\right|

Duplicate execution is a release failure:

N_{duplicate}=0

Missing committed execution is also a release failure:

N_{committed}-C_{applied}=0

Consistency Contract

The validation contract is intentionally narrow and testable:

  1. a write is acknowledged only after the command is durably replicated to a majority;
  2. a read that claims linearizability must observe all writes that completed before the read began;
  3. a leader with an old term or fencing token cannot write to the protected resource;
  4. recovery replay is idempotent by command id;
  5. during a partition, the minority side must reject writes or enter degraded read-only behavior.

For operation history checking, if operation a completes before operation b starts:

response(a)<invoke(b)

then the linearized order must place a before b:

a<_{lin}b

This is stronger than eventual consistency. If the product chooses eventual consistency for a specific endpoint, the endpoint must say so and must not be validated against the same linearizable read contract.

Step 1: Quorum and Failure Screen

Majority quorum size:

\displaystyle Q=\left\lfloor\frac{N}{2}\right\rfloor+1

With N=5:

\displaystyle Q=\left\lfloor\frac{5}{2}\right\rfloor+1=3

Crash faults tolerated by majority quorum:

\displaystyle f=\left\lfloor\frac{N-1}{2}\right\rfloor=2

Any two majorities intersect:

Q+Q>N

Substitute:

3+3=6>5

The project should not stop at this calculation. Majority geometry supports safety, but real safety also depends on election restrictions, durable log persistence, fencing, client retry semantics, and membership-change rules.

Step 2: Commit Index Evidence

Let match_j be the highest replicated log index known to be stored on replica j. The commit index is:

C_{commit}=\max\left\{i:\left|\{j:match_j\geq i\}\right|\geq Q\right\}

Before fault injection, record:

Replicamatch_j
leader12480
follower 112480
follower 212480
follower 312472
follower 412450

For index 12480, three replicas have the entry:

\left|\{j:match_j\geq12480\}\right|=3

Therefore:

C_{commit}=12480

The validation report must preserve this number across failover. After a new leader is elected, the test must prove:

C_{commit,new}\geq C_{commit,old}

If the new leader exposes a lower committed state, the system has failed the release gate.

Step 3: Partition Test Matrix

Run controlled partition tests with traffic, not idle nodes.

TestFaultExpected resultRequired evidence
P1isolate two followersmajority side continueswrite acknowledgements include at least three replicas
P2isolate old leader with one followerold leader rejects writesclient errors, no protected-resource writes, stale token rejected
P3isolate three-node side from two-node sidethree-node side may elect leaderterm change, commit index continuity, no minority writes
P4heal partition after divergent uncommitted entriesuncommitted minority entries are discarded or overwritten safelylog reconciliation trace and command-id audit
P5partition during read loadlinearizable reads use leader/quorum path or are rejectedread-index evidence or explicit degraded response

The critical observation is not only which side is available. It is whether the unavailable or stale side is proven unable to act.

Step 4: Failover Timing and Error Budget

A simplified failover time model is:

T_{failover}=T_{detect}+T_{elect}+T_{fence}+T_{catchup}

Use:

ComponentValue
failure detectionT_{detect}=1.2\ \text{s}
electionT_{elect}=0.45\ \text{s}
fencing confirmationT_{fence}=0.35\ \text{s}
log catch-up before serving writesT_{catchup}=0.60\ \text{s}

Then:

T_{failover}=1.2+0.45+0.35+0.60=2.60\ \text{s}

Compare with the recovery objective:

2.60\ \text{s}\leq3.00\ \text{s}

The timing screen passes. The consistency screen still needs evidence that no stale authority wrote during those 2.60\ \text{s}.

If the monthly error budget for the service is:

T_{budget}=43.2\ \text{min}

then one failover consumes:

\displaystyle B_{used}=\frac{2.60}{43.2(60)}=0.00100

or about:

0.100\%

of the monthly downtime budget, assuming the failover is customer-visible downtime.

Step 5: Fencing Validation

Fencing prevents a stale leader from continuing to own the protected resource. Model leadership authority with a monotonically increasing fencing epoch:

E_{new}>E_{old}

The protected resource accepts a command only when:

E_{request}=E_{current}

During the partition test, force the old leader to attempt a write with:

E_{request}=41

after the new leader has established:

E_{current}=42

The expected outcome is:

41\neq42

so the write is rejected.

The validation evidence should include the rejected write, resource-side log, command id, epoch value, time source, leader id, and proof that the rejection happened at the resource boundary. A coordinator log alone is not enough if the stale side can still write directly.

Step 6: Replay and Duplicate Command Check

After failover and restart, replay all committed entries from the last durable snapshot. If the snapshot covers:

C_{snapshot}=10000

and the commit index is:

C_{commit}=12480

then the replay range is:

N_{replay}=C_{commit}-C_{snapshot}=2480\ \text{entries}

If replay rate is:

r_{replay}=900\ \text{entries/s}

estimated replay time is:

\displaystyle T_{replay}=\frac{2480}{900}=2.76\ \text{s}

This exceeds the 0.60\ \text{s} catch-up allowance used in the failover timing screen. The engineering decision is not to hide the mismatch. The team must either reduce snapshot interval, increase replay throughput, avoid serving until catch-up completes, or revise the recovery objective.

For idempotent application, every command id should satisfy:

count(cmd\_id)=1

for committed applied commands. For retried client commands, multiple submissions are acceptable only if the state machine still applies the command once.

Step 7: Linearizability History Check

Capture operation histories with invocation time, response time, client id, command id, leader id, term, commit index, result, and observed state.

Minimum fields:

FieldWhy it matters
invocation timestamporders real-time operation starts
response timestampdetects real-time precedence
command iddetects duplicate or missing application
term and leader iddetects stale authority
commit indexties response to replicated log state
read observed versionproves read freshness or exposes stale reads
result codeseparates rejected degraded behavior from false success

A compact consistency assertion for a read r is:

C_{observed}(r)\geq C_{commit}(w)

for every write w whose response completed before the read invocation:

response(w)<invoke(r)

If this condition fails and the endpoint claimed linearizability, the release fails. If the endpoint is explicitly eventually consistent, the report must state the staleness bound, repair mechanism, user-visible behavior, and conflict policy.

Step 8: Release Test Plan

The release test plan should include:

  1. baseline majority commit under steady load;
  2. leader crash during write traffic;
  3. follower lag and catch-up before leader election;
  4. minority partition write rejection;
  5. stale leader fencing at the protected resource;
  6. partition heal with uncommitted log reconciliation;
  7. client retry with duplicate command id;
  8. read freshness during and after failover;
  9. snapshot recovery and replay timing;
  10. telemetry review after canary rollout.

Every test must state expected behavior before it runs. A test that only observes what happened after the fact is weaker evidence than a test with a declared release gate.

Deliverable Template

The final package should contain:

  1. consistency contract and endpoints covered;
  2. cluster topology, quorum size, membership rule and fault assumptions;
  3. test matrix with fault injection method and expected outcome;
  4. operation-history file or summarized linearizability check;
  5. fencing-token evidence from the protected resource;
  6. commit-index continuity before and after failover;
  7. replay and duplicate-command audit;
  8. p95, p99 and worst observed latency during tests;
  9. error-budget impact and rollback trigger;
  10. residual risks and release decision.

Acceptance Criteria

Accept the release only if:

  1. acknowledged committed entries are not lost after failover;
  2. minority or stale leaders cannot write to the protected resource;
  3. linearizable reads observe completed writes or are explicitly rejected;
  4. retries do not duplicate command execution;
  5. recovery replay completes before the service advertises full recovery;
  6. p99 latency and error-budget impact remain inside release limits;
  7. telemetry can distinguish original requests, retries, rejections, stale-leader attempts, and replay work.

Common Failure Modes

Common failures include accepting writes on both sides of a partition, treating leader election as fencing, serving linearizable reads from a lagging follower, acknowledging writes before durable majority replication, retrying commands without idempotency keys, compacting logs before a slow follower has a safe snapshot, and validating failover with no traffic.

Another failure is assuming that a healthy canary proves consistency. Many consistency bugs appear only during a specific interleaving: write in flight, leader crash, stale retry, partition heal, or replay after snapshot. The validation package must therefore capture the history around the fault, not only aggregate success rate.

Engineering Limitations

This project is a validation framework, not a formal proof. It does not replace model checking, protocol review, security review, storage durability testing, membership-change analysis, or production incident learning. It gives the engineering team a practical release package: what must be measured, which histories must be preserved, and what evidence is strong enough to claim safe failover for a replicated-log service.

REF

See also