Project
Concurrency Load Test and Race Condition Debugging Project
Computer engineering project for validating a concurrent service with load testing, race reproduction, instrumentation, queue checks, mitigation, retest evidence, and release decision.
This project builds a concurrency load-test and race-condition debugging package for a service that updates shared state under high request concurrency. The final deliverable is a reviewable engineering report: workload model, test harness, invariants, traces, queue checks, race reproduction, root cause, mitigation, retest evidence, and release decision.
The project is not a generic programming tutorial. It treats concurrency as an engineering validation problem: what state can be corrupted, how the failure is reproduced, which measurements prove the mechanism, and which retest evidence is strong enough for release.
Project Objective
Validate a reservation service that decrements shared inventory when concurrent clients reserve the same resource. The release decision must answer:
- Can the service preserve the inventory invariant under peak concurrent load?
- Can the test harness reproduce the suspected race condition deterministically enough for debugging?
- Which metric proves whether the failure is a race, a queueing overload, a retry duplication, or a storage transaction problem?
- Does the mitigation preserve correctness without violating latency and throughput targets?
- What evidence is required before release?
The final deliverable should be a concurrency validation report with raw data, trace examples, failure timeline, mitigation design, retest summary, residual risks, and operational monitoring triggers.
System Under Test
The simplified service has:
| Item | Project value |
|---|---|
| service function | reserve one unit from a finite inventory pool |
| state store | transactional key-value store with version field |
| workers | stateless API workers behind a load balancer |
| peak request rate | 900\ \text{requests/s} |
| target p99 latency | 150\ \text{ms} |
| maximum accepted invariant error | zero lost, duplicated, or negative inventory updates |
| allowed 5xx error rate during peak | below 0.1\% |
| release mode | staged rollout with monitoring and rollback trigger |
The core invariant is:
where I_0 is starting inventory. Any mismatch means the system has lost, duplicated, or over-applied a state transition.
Baseline Test Plan
The test plan should include:
- a single-thread baseline to prove the functional path;
- a controlled two-client race reproducer for the same inventory key;
- a load test with realistic key popularity and burst profile;
- failure injection for retries, worker restart, storage timeout and duplicate request;
- trace capture for request id, inventory key, read version, write version, result and latency;
- invariant reconciliation after every test phase;
- a retest after mitigation under the same workload.
Do not begin with maximum load. First prove that the failure mechanism can be isolated, observed and repeated.
Step 1: Capacity Screen
The service has:
Mean service time excluding storage contention is:
Peak arrival rate:
Worker utilization screen:
Engineering Comment
The worker pool has nominal capacity margin. If the system still fails at peak load, the likely bottleneck is not average worker capacity. The next checks should focus on hot keys, storage transaction conflicts, retry amplification, queue growth, lock contention, and tail latency.
Step 2: Define the Race Reproducer
Use a two-client reproducer against one inventory item with starting value:
Both clients issue a reserve request for the same key. The unsafe implementation does:
- read current inventory;
- if inventory is greater than zero, compute new value;
- write the new value;
- return success.
The race timeline is:
| Time | Client A | Client B |
|---|---|---|
| t_1 | read I=1 | |
| t_2 | read I=1 | |
| t_3 | write I=0 and confirm | |
| t_4 | write I=0 and confirm |
Observed confirmed reservations:
Remaining inventory:
Invariant check:
Engineering Comment
The invariant fails by one unit. The service did not create a negative inventory value, but it still oversold the resource because two confirmations were issued for one unit. This is a race condition in the read-modify-write sequence.
Step 3: Add Observability
Record one trace row per reservation attempt.
| Field | Purpose |
|---|---|
| request id | identify duplicates and retries |
| inventory key | detect hot keys |
| worker id | correlate with process, thread and deployment version |
| read version | prove which state was observed |
| write version | prove whether the update was conditional |
| result | success, conflict, rejected, timeout or retry |
| latency | mean, p95, p99 and maximum |
| queue delay | separate waiting from service time |
| retry count | detect retry amplification |
Without version and result fields, the team may see only successful responses and miss that the success count violates the state invariant.
Step 4: Load Test the Baseline
Run the unsafe implementation for:
Request rate:
Total attempts:
The test finds:
| Metric | Observed value |
|---|---|
| confirmed reservations | 38{,}420 |
| expected confirmed reservations from inventory records | 38{,}000 |
| invariant mismatch | 420 |
| p99 latency | 132\ \text{ms} |
| 5xx error rate | 0.03\% |
Invariant error rate per confirmed reservation:
Engineering Comment
Latency and server errors pass, but correctness fails. This is a release blocker. A load test that reports only p99 latency and 5xx rate would miss a severe data-integrity defect.
Step 5: Mitigate with Conditional Update
Replace the unsafe write with a conditional update:
- read inventory value and version;
- compute reservation only if inventory is available;
- write only if the stored version still equals the read version;
- if the version changed, retry with a bounded retry policy or return a conflict;
- make request ids idempotent so client retries do not duplicate reservations.
The corrected update should satisfy:
and:
only for one successful version transition.
Engineering Comment
The fix is not simply “add retries.” Retries without idempotency can duplicate side effects. The mitigation must protect the shared state transition and the client-visible confirmation record.
Step 6: Check Retry Amplification
After mitigation, conflict rate under hot-key load is:
The policy permits at most:
A simple expected-attempts screen for independent retry conflicts is:
Effective storage attempt rate at peak:
Engineering Comment
The mitigation adds about 21.2\% storage attempt load under this conflict rate. If the storage system cannot absorb that load, the project needs backpressure, hot-key sharding, admission control, queue limits, or a different reservation design.
Step 7: Retest Evidence
Run the same 10-minute peak test after the conditional update and idempotency change.
| Metric | Baseline | After mitigation | Criterion |
|---|---|---|---|
| invariant mismatch | 420 | 0 | 0 |
| p99 latency | 132\ \text{ms} | 141\ \text{ms} | below 150\ \text{ms} |
| 5xx error rate | 0.03\% | 0.04\% | below 0.1\% |
| conflict responses | not measured | 6.8\% | reported and bounded |
| duplicate confirmations | not measured | 0 | 0 |
| maximum queue delay | 28\ \text{ms} | 35\ \text{ms} | below 50\ \text{ms} |
The corrected test passes the stated release criteria.
Engineering Comment
The retest should use the same workload seed or a documented equivalent workload. Otherwise a lower hot-key collision rate could make the fix look better than it is.
Step 8: Release Decision
Prepare the release package.
| Release item | Status | Evidence |
|---|---|---|
| race reproduction | closed | two-client trace shows unsafe interleaving |
| root cause | closed | read-modify-write without conditional version check |
| mitigation | closed | conditional update, idempotency key and bounded retry |
| invariant under load | pass | zero mismatch in repeated peak tests |
| latency target | pass | p99 below 150\ \text{ms} after mitigation |
| retry amplification | conditional pass | storage attempt load within observed margin |
| observability | pass | request id, key, version, result, retry and queue metrics present |
| rollout | conditional pass | canary with invariant mismatch rollback trigger |
Recommended decision:
Release to canary only. Expand rollout if invariant mismatch remains zero, p99 latency stays below target, storage conflict rate remains within the tested envelope, and duplicate confirmations remain zero.
Monitoring Triggers
Production monitoring should alert on:
- invariant mismatch greater than zero;
- duplicate request id with multiple confirmations;
- inventory key conflict rate above the tested envelope;
- retry attempts per request above the validated limit;
- p99 latency above target;
- queue delay above target;
- storage conditional-write failures that rise faster than traffic;
- canary error rate or rollback metric breach.
Common Failure Modes
Common failures in concurrency validation include:
- testing average throughput but not state invariants;
- reproducing the bug manually but not capturing the interleaving trace;
- fixing one code path while retries or cancellation still duplicate side effects;
- treating conflict retries as free load;
- using random load that does not create hot-key collisions;
- validating a single worker but releasing multiple workers;
- missing version, request id or queue delay in telemetry;
- rolling out without a rollback trigger tied to correctness.
Limitations
This project uses a simplified reservation service. Real systems may need stronger transaction semantics, distributed locks, consensus, partition handling, schema migration checks, clock-drift review, storage failover tests, or formal verification for critical invariants. The core workflow still transfers: define invariants, reproduce the race, instrument causality, mitigate the shared-state transition, retest under representative load, and release only with monitoring tied to the failure mode.