Project

Concurrency Load Test and Race Condition Debugging Project

Computer engineering project for validating a concurrent service with load testing, race reproduction, instrumentation, queue checks, mitigation, retest evidence, and release decision.

Branch: Computer Engineering
Content: Project
Updated: Jun 23, 2026
Revision: v1.0.0 · reviewed

This project builds a concurrency load-test and race-condition debugging package for a service that updates shared state under high request concurrency. The final deliverable is a reviewable engineering report: workload model, test harness, invariants, traces, queue checks, race reproduction, root cause, mitigation, retest evidence, and release decision.

The project is not a generic programming tutorial. It treats concurrency as an engineering validation problem: what state can be corrupted, how the failure is reproduced, which measurements prove the mechanism, and which retest evidence is strong enough for release.

Project Objective

Validate a reservation service that decrements shared inventory when concurrent clients reserve the same resource. The release decision must answer:

Can the service preserve the inventory invariant under peak concurrent load?
Can the test harness reproduce the suspected race condition deterministically enough for debugging?
Which metric proves whether the failure is a race, a queueing overload, a retry duplication, or a storage transaction problem?
Does the mitigation preserve correctness without violating latency and throughput targets?
What evidence is required before release?

The final deliverable should be a concurrency validation report with raw data, trace examples, failure timeline, mitigation design, retest summary, residual risks, and operational monitoring triggers.

System Under Test

The simplified service has:

Item	Project value
service function	reserve one unit from a finite inventory pool
state store	transactional key-value store with version field
workers	stateless API workers behind a load balancer
peak request rate	$900\ \text{requests/s}$
target p99 latency	$150\ \text{ms}$
maximum accepted invariant error	zero lost, duplicated, or negative inventory updates
allowed 5xx error rate during peak	below $0.1\%$
release mode	staged rollout with monitoring and rollback trigger

The core invariant is:

I_0=I_{remaining}+N_{confirmed}+N_{cancelled\ after\ restore}

where $I_0$ is starting inventory. Any mismatch means the system has lost, duplicated, or over-applied a state transition.

Baseline Test Plan

The test plan should include:

a single-thread baseline to prove the functional path;
a controlled two-client race reproducer for the same inventory key;
a load test with realistic key popularity and burst profile;
failure injection for retries, worker restart, storage timeout and duplicate request;
trace capture for request id, inventory key, read version, write version, result and latency;
invariant reconciliation after every test phase;
a retest after mitigation under the same workload.

Do not begin with maximum load. First prove that the failure mechanism can be isolated, observed and repeated.

Step 1: Capacity Screen

The service has:

c=6\ \text{workers}

Mean service time excluding storage contention is:

S=4.0\ \text{ms}=0.004\ \text{s}

Peak arrival rate:

\lambda=900\ \text{requests/s}

Worker utilization screen:

\displaystyle \rho=\frac{\lambda S}{c}

\displaystyle \rho=\frac{900(0.004)}{6}=0.60

Engineering Comment

The worker pool has nominal capacity margin. If the system still fails at peak load, the likely bottleneck is not average worker capacity. The next checks should focus on hot keys, storage transaction conflicts, retry amplification, queue growth, lock contention, and tail latency.

Step 2: Define the Race Reproducer

Use a two-client reproducer against one inventory item with starting value:

I_0=1

Both clients issue a reserve request for the same key. The unsafe implementation does:

read current inventory;
if inventory is greater than zero, compute new value;
write the new value;
return success.

The race timeline is:

Time	Client A	Client B
$t_1$	read $I=1$
$t_2$		read $I=1$
$t_3$	write $I=0$ and confirm
$t_4$		write $I=0$ and confirm

Observed confirmed reservations:

N_{confirmed}=2

Remaining inventory:

I_{remaining}=0

Invariant check:

I_0-I_{remaining}-N_{confirmed}=1-0-2=-1

Engineering Comment

The invariant fails by one unit. The service did not create a negative inventory value, but it still oversold the resource because two confirmations were issued for one unit. This is a race condition in the read-modify-write sequence.

Step 3: Add Observability

Record one trace row per reservation attempt.

Field	Purpose
request id	identify duplicates and retries
inventory key	detect hot keys
worker id	correlate with process, thread and deployment version
read version	prove which state was observed
write version	prove whether the update was conditional
result	success, conflict, rejected, timeout or retry
latency	mean, p95, p99 and maximum
queue delay	separate waiting from service time
retry count	detect retry amplification

Without version and result fields, the team may see only successful responses and miss that the success count violates the state invariant.

Step 4: Load Test the Baseline

Run the unsafe implementation for:

T=10\ \text{min}=600\ \text{s}

Request rate:

\lambda=900\ \text{requests/s}

Total attempts:

N_{attempts}=\lambda T=900(600)=540{,}000

The test finds:

Metric	Observed value
confirmed reservations	$38{,}420$
expected confirmed reservations from inventory records	$38{,}000$
invariant mismatch	$420$
p99 latency	$132\ \text{ms}$
5xx error rate	$0.03\%$

Invariant error rate per confirmed reservation:

\displaystyle E_I=\frac{420}{38{,}420}\times100=1.09\%

Engineering Comment

Latency and server errors pass, but correctness fails. This is a release blocker. A load test that reports only p99 latency and 5xx rate would miss a severe data-integrity defect.

Step 5: Mitigate with Conditional Update

Replace the unsafe write with a conditional update:

read inventory value and version;
compute reservation only if inventory is available;
write only if the stored version still equals the read version;
if the version changed, retry with a bounded retry policy or return a conflict;
make request ids idempotent so client retries do not duplicate reservations.

The corrected update should satisfy:

write\ succeeds \Rightarrow version_{stored}=version_{read}

and:

I_{new}=I_{old}-1

only for one successful version transition.

Engineering Comment

The fix is not simply “add retries.” Retries without idempotency can duplicate side effects. The mitigation must protect the shared state transition and the client-visible confirmation record.

Step 6: Check Retry Amplification

After mitigation, conflict rate under hot-key load is:

p_c=0.18

The policy permits at most:

r=2\ \text{retries}

A simple expected-attempts screen for independent retry conflicts is:

E[A]=1+p_c+p_c^2

E[A]=1+0.18+0.18^2=1.212

Effective storage attempt rate at peak:

\lambda_{store}=900(1.212)=1091\ \text{attempts/s}

Engineering Comment

The mitigation adds about $21.2\%$ storage attempt load under this conflict rate. If the storage system cannot absorb that load, the project needs backpressure, hot-key sharding, admission control, queue limits, or a different reservation design.

Step 7: Retest Evidence

Run the same 10-minute peak test after the conditional update and idempotency change.

Metric	Baseline	After mitigation	Criterion
invariant mismatch	$420$	$0$	$0$
p99 latency	$132\ \text{ms}$	$141\ \text{ms}$	below $150\ \text{ms}$
5xx error rate	$0.03\%$	$0.04\%$	below $0.1\%$
conflict responses	not measured	$6.8\%$	reported and bounded
duplicate confirmations	not measured	$0$	$0$
maximum queue delay	$28\ \text{ms}$	$35\ \text{ms}$	below $50\ \text{ms}$

The corrected test passes the stated release criteria.

Engineering Comment

The retest should use the same workload seed or a documented equivalent workload. Otherwise a lower hot-key collision rate could make the fix look better than it is.

Step 8: Release Decision

Prepare the release package.

Release item	Status	Evidence
race reproduction	closed	two-client trace shows unsafe interleaving
root cause	closed	read-modify-write without conditional version check
mitigation	closed	conditional update, idempotency key and bounded retry
invariant under load	pass	zero mismatch in repeated peak tests
latency target	pass	p99 below $150\ \text{ms}$ after mitigation
retry amplification	conditional pass	storage attempt load within observed margin
observability	pass	request id, key, version, result, retry and queue metrics present
rollout	conditional pass	canary with invariant mismatch rollback trigger

Recommended decision:

Release to canary only. Expand rollout if invariant mismatch remains zero, p99 latency stays below target, storage conflict rate remains within the tested envelope, and duplicate confirmations remain zero.

Monitoring Triggers

Production monitoring should alert on:

invariant mismatch greater than zero;
duplicate request id with multiple confirmations;
inventory key conflict rate above the tested envelope;
retry attempts per request above the validated limit;
p99 latency above target;
queue delay above target;
storage conditional-write failures that rise faster than traffic;
canary error rate or rollback metric breach.

Common Failure Modes

Common failures in concurrency validation include:

testing average throughput but not state invariants;
reproducing the bug manually but not capturing the interleaving trace;
fixing one code path while retries or cancellation still duplicate side effects;
treating conflict retries as free load;
using random load that does not create hot-key collisions;
validating a single worker but releasing multiple workers;
missing version, request id or queue delay in telemetry;
rolling out without a rollback trigger tied to correctness.

Limitations

This project uses a simplified reservation service. Real systems may need stronger transaction semantics, distributed locks, consensus, partition handling, schema migration checks, clock-drift review, storage failover tests, or formal verification for critical invariants. The core workflow still transfers: define invariants, reproduce the race, instrument causality, mitigate the shared-state transition, retest under representative load, and release only with monitoring tied to the failure mode.

REF

Disciplines

Concurrency Load Test and Race Condition Debugging Project

Project Objective

System Under Test

Baseline Test Plan

Step 1: Capacity Screen

Engineering Comment

Step 2: Define the Race Reproducer

Engineering Comment

Step 3: Add Observability

Step 4: Load Test the Baseline

Engineering Comment

Step 5: Mitigate with Conditional Update

Engineering Comment

Step 6: Check Retry Amplification

Engineering Comment

Step 7: Retest Evidence

Engineering Comment

Step 8: Release Decision

Monitoring Triggers

Common Failure Modes

Limitations

See also