Project

Concurrency Load Test and Race Condition Debugging Project

Computer engineering project for validating a concurrent service with load testing, race reproduction, instrumentation, queue checks, mitigation, retest evidence, and release decision.

This project builds a concurrency load-test and race-condition debugging package for a service that updates shared state under high request concurrency. The final deliverable is a reviewable engineering report: workload model, test harness, invariants, traces, queue checks, race reproduction, root cause, mitigation, retest evidence, and release decision.

The project is not a generic programming tutorial. It treats concurrency as an engineering validation problem: what state can be corrupted, how the failure is reproduced, which measurements prove the mechanism, and which retest evidence is strong enough for release.

Project Objective

Validate a reservation service that decrements shared inventory when concurrent clients reserve the same resource. The release decision must answer:

  1. Can the service preserve the inventory invariant under peak concurrent load?
  2. Can the test harness reproduce the suspected race condition deterministically enough for debugging?
  3. Which metric proves whether the failure is a race, a queueing overload, a retry duplication, or a storage transaction problem?
  4. Does the mitigation preserve correctness without violating latency and throughput targets?
  5. What evidence is required before release?

The final deliverable should be a concurrency validation report with raw data, trace examples, failure timeline, mitigation design, retest summary, residual risks, and operational monitoring triggers.

System Under Test

The simplified service has:

ItemProject value
service functionreserve one unit from a finite inventory pool
state storetransactional key-value store with version field
workersstateless API workers behind a load balancer
peak request rate900\ \text{requests/s}
target p99 latency150\ \text{ms}
maximum accepted invariant errorzero lost, duplicated, or negative inventory updates
allowed 5xx error rate during peakbelow 0.1\%
release modestaged rollout with monitoring and rollback trigger

The core invariant is:

I_0=I_{remaining}+N_{confirmed}+N_{cancelled\ after\ restore}

where I_0 is starting inventory. Any mismatch means the system has lost, duplicated, or over-applied a state transition.

Baseline Test Plan

The test plan should include:

  1. a single-thread baseline to prove the functional path;
  2. a controlled two-client race reproducer for the same inventory key;
  3. a load test with realistic key popularity and burst profile;
  4. failure injection for retries, worker restart, storage timeout and duplicate request;
  5. trace capture for request id, inventory key, read version, write version, result and latency;
  6. invariant reconciliation after every test phase;
  7. a retest after mitigation under the same workload.

Do not begin with maximum load. First prove that the failure mechanism can be isolated, observed and repeated.

Step 1: Capacity Screen

The service has:

c=6\ \text{workers}

Mean service time excluding storage contention is:

S=4.0\ \text{ms}=0.004\ \text{s}

Peak arrival rate:

\lambda=900\ \text{requests/s}

Worker utilization screen:

\displaystyle \rho=\frac{\lambda S}{c}
\displaystyle \rho=\frac{900(0.004)}{6}=0.60

Engineering Comment

The worker pool has nominal capacity margin. If the system still fails at peak load, the likely bottleneck is not average worker capacity. The next checks should focus on hot keys, storage transaction conflicts, retry amplification, queue growth, lock contention, and tail latency.

Step 2: Define the Race Reproducer

Use a two-client reproducer against one inventory item with starting value:

I_0=1

Both clients issue a reserve request for the same key. The unsafe implementation does:

  1. read current inventory;
  2. if inventory is greater than zero, compute new value;
  3. write the new value;
  4. return success.

The race timeline is:

TimeClient AClient B
t_1read I=1
t_2read I=1
t_3write I=0 and confirm
t_4write I=0 and confirm

Observed confirmed reservations:

N_{confirmed}=2

Remaining inventory:

I_{remaining}=0

Invariant check:

I_0-I_{remaining}-N_{confirmed}=1-0-2=-1

Engineering Comment

The invariant fails by one unit. The service did not create a negative inventory value, but it still oversold the resource because two confirmations were issued for one unit. This is a race condition in the read-modify-write sequence.

Step 3: Add Observability

Record one trace row per reservation attempt.

FieldPurpose
request ididentify duplicates and retries
inventory keydetect hot keys
worker idcorrelate with process, thread and deployment version
read versionprove which state was observed
write versionprove whether the update was conditional
resultsuccess, conflict, rejected, timeout or retry
latencymean, p95, p99 and maximum
queue delayseparate waiting from service time
retry countdetect retry amplification

Without version and result fields, the team may see only successful responses and miss that the success count violates the state invariant.

Step 4: Load Test the Baseline

Run the unsafe implementation for:

T=10\ \text{min}=600\ \text{s}

Request rate:

\lambda=900\ \text{requests/s}

Total attempts:

N_{attempts}=\lambda T=900(600)=540{,}000

The test finds:

MetricObserved value
confirmed reservations38{,}420
expected confirmed reservations from inventory records38{,}000
invariant mismatch420
p99 latency132\ \text{ms}
5xx error rate0.03\%

Invariant error rate per confirmed reservation:

\displaystyle E_I=\frac{420}{38{,}420}\times100=1.09\%

Engineering Comment

Latency and server errors pass, but correctness fails. This is a release blocker. A load test that reports only p99 latency and 5xx rate would miss a severe data-integrity defect.

Step 5: Mitigate with Conditional Update

Replace the unsafe write with a conditional update:

  1. read inventory value and version;
  2. compute reservation only if inventory is available;
  3. write only if the stored version still equals the read version;
  4. if the version changed, retry with a bounded retry policy or return a conflict;
  5. make request ids idempotent so client retries do not duplicate reservations.

The corrected update should satisfy:

write\ succeeds \Rightarrow version_{stored}=version_{read}

and:

I_{new}=I_{old}-1

only for one successful version transition.

Engineering Comment

The fix is not simply “add retries.” Retries without idempotency can duplicate side effects. The mitigation must protect the shared state transition and the client-visible confirmation record.

Step 6: Check Retry Amplification

After mitigation, conflict rate under hot-key load is:

p_c=0.18

The policy permits at most:

r=2\ \text{retries}

A simple expected-attempts screen for independent retry conflicts is:

E[A]=1+p_c+p_c^2
E[A]=1+0.18+0.18^2=1.212

Effective storage attempt rate at peak:

\lambda_{store}=900(1.212)=1091\ \text{attempts/s}

Engineering Comment

The mitigation adds about 21.2\% storage attempt load under this conflict rate. If the storage system cannot absorb that load, the project needs backpressure, hot-key sharding, admission control, queue limits, or a different reservation design.

Step 7: Retest Evidence

Run the same 10-minute peak test after the conditional update and idempotency change.

MetricBaselineAfter mitigationCriterion
invariant mismatch42000
p99 latency132\ \text{ms}141\ \text{ms}below 150\ \text{ms}
5xx error rate0.03\%0.04\%below 0.1\%
conflict responsesnot measured6.8\%reported and bounded
duplicate confirmationsnot measured00
maximum queue delay28\ \text{ms}35\ \text{ms}below 50\ \text{ms}

The corrected test passes the stated release criteria.

Engineering Comment

The retest should use the same workload seed or a documented equivalent workload. Otherwise a lower hot-key collision rate could make the fix look better than it is.

Step 8: Release Decision

Prepare the release package.

Release itemStatusEvidence
race reproductionclosedtwo-client trace shows unsafe interleaving
root causeclosedread-modify-write without conditional version check
mitigationclosedconditional update, idempotency key and bounded retry
invariant under loadpasszero mismatch in repeated peak tests
latency targetpassp99 below 150\ \text{ms} after mitigation
retry amplificationconditional passstorage attempt load within observed margin
observabilitypassrequest id, key, version, result, retry and queue metrics present
rolloutconditional passcanary with invariant mismatch rollback trigger

Recommended decision:

Release to canary only. Expand rollout if invariant mismatch remains zero, p99 latency stays below target, storage conflict rate remains within the tested envelope, and duplicate confirmations remain zero.

Monitoring Triggers

Production monitoring should alert on:

  • invariant mismatch greater than zero;
  • duplicate request id with multiple confirmations;
  • inventory key conflict rate above the tested envelope;
  • retry attempts per request above the validated limit;
  • p99 latency above target;
  • queue delay above target;
  • storage conditional-write failures that rise faster than traffic;
  • canary error rate or rollback metric breach.

Common Failure Modes

Common failures in concurrency validation include:

  • testing average throughput but not state invariants;
  • reproducing the bug manually but not capturing the interleaving trace;
  • fixing one code path while retries or cancellation still duplicate side effects;
  • treating conflict retries as free load;
  • using random load that does not create hot-key collisions;
  • validating a single worker but releasing multiple workers;
  • missing version, request id or queue delay in telemetry;
  • rolling out without a rollback trigger tied to correctness.

Limitations

This project uses a simplified reservation service. Real systems may need stronger transaction semantics, distributed locks, consensus, partition handling, schema migration checks, clock-drift review, storage failover tests, or formal verification for critical invariants. The core workflow still transfers: define invariants, reproduce the race, instrument causality, mitigate the shared-state transition, retest under representative load, and release only with monitoring tied to the failure mode.

REF

See also