Case study
Cache False Sharing Performance Regression Case Study
Computer engineering case study on diagnosing a multicore performance regression caused by false sharing, cache-line ownership traffic, weak speedup, padding tradeoffs, and validation evidence.
False sharing is a performance failure in which different threads update different variables that happen to occupy the same cache line. The program is logically correct, and each thread may appear to own its own counter or state object. The hardware, however, manages ownership at cache-line granularity, so independent writes can force the same line to move between cores.
This case study follows a telemetry aggregation service that slows down after a multicore scaling change. The team expected eight worker threads to improve throughput. Instead, total throughput falls below the single-thread baseline and p99.9 latency violates the service target. The root cause is not algorithmic complexity or insufficient CPU frequency. It is false sharing in a per-thread counter array.
The purpose is to connect data layout, cache-line geometry, throughput measurement, speedup efficiency, coherence traffic, and validation evidence into one engineering diagnosis.
Case Context
The service receives machine telemetry, validates messages, updates per-worker counters, and flushes summary metrics once per second. The hot path increments counters for accepted samples, rejected samples, and parser states.
A refactor replaced one shared locked counter with an array of per-worker counters:
typedef struct {
uint64_t samples;
uint64_t errors;
} WorkerCounter;
WorkerCounter counters[8];
The change removed lock contention, but performance became worse. The apparent design intent was good: each worker writes only its own counter. The missing review was physical memory layout.
| Item | Value |
|---|---|
| Worker threads | 8 |
| Processor cores used | 8 |
| Cache-line size | 64\ \text{bytes} |
| Counter object size | 16\ \text{bytes} |
| Single-thread baseline | 220\ \text{million updates/s} |
| Eight-thread unpadded throughput | 140\ \text{million updates/s} |
| Eight-thread padded throughput | 1.42\ \text{billion updates/s} |
| Required sustained throughput | 900\ \text{million updates/s} |
| Latency target | p99.9 below 1.0\ \text{ms} |
| Unpadded p99.9 latency | 2.8\ \text{ms} |
| Padded p99.9 latency | 0.42\ \text{ms} |
The incident is important because the source code change looks like a standard scalability improvement. The performance failure is visible only when the data structure is interpreted as hardware traffic.
Initial Symptom
The deployment report contains three confusing facts:
- CPU utilization rises across all cores.
- Lock wait time falls compared with the old implementation.
- Throughput falls and tail latency increases.
This combination rules out the simplest explanation that a lock is blocking the system. It suggests that the workers are running, but the cores are spending time on memory-system coordination rather than useful work.
Useful profiling evidence includes:
| Evidence | Interpretation |
|---|---|
| high cache-to-cache transfers on the counter addresses | cache-line ownership is moving between cores |
| low lock wait time | not primarily a mutex problem |
| worker CPU time high but useful throughput low | cores are busy with coherence overhead |
| performance improves when workers are pinned | scheduling contributes, but does not remove the root cause |
| performance improves when counters are padded | data layout is causal |
The last observation is decisive only if the padded version is tested under the same workload, compiler settings, thread count, and hardware.
Cache-Line Layout
The counter object uses:
The cache-line size is:
The number of counters that fit in one cache line is:
Therefore, the eight-worker array occupies two cache lines if aligned favorably:
| Cache line | Counter objects |
|---|---|
| line 0 | worker 0, worker 1, worker 2, worker 3 |
| line 1 | worker 4, worker 5, worker 6, worker 7 |
Each worker writes a different counter. At the source-code level, there is no shared variable. At the cache-coherence level, four workers repeatedly write the same cache line.
For a core to write its counter, it must obtain ownership of the cache line containing that counter. Another core writing a neighboring counter invalidates or transfers that same line. The line bounces between cores even though the variables are distinct.
Throughput and Speedup Check
The single-thread baseline is:
With eight workers, ideal linear throughput would be:
The unpadded implementation achieves:
Speedup relative to the single-thread baseline is:
The eight-thread implementation is slower than the single-thread baseline:
Parallel efficiency is:
Only about 8.0\% of the ideal eight-core scaling is being converted into useful throughput. That is not a tuning issue. It is evidence of a structural bottleneck.
Coherence Traffic Estimate
A first-pass lower-bound estimate treats each counter update as requiring ownership movement of a 64\ \text{byte} line. This is simplified, but it gives the correct engineering scale.
At:
minimum cache-line traffic associated with ownership movement is:
So the coherence system may be forced to handle at least:
of cache-line movement for a counter update that carries only an 8\ \text{byte} increment. Actual traffic can be higher because protocols, retries, evictions, scheduling, and other memory operations add overhead.
The key point is not the exact bandwidth. The key point is that the implementation turns tiny independent counter writes into high-rate inter-core ownership traffic.
Padded Counter Design
The corrective design gives each worker counter its own cache line. In C++-style notation, the intent is both size and alignment:
struct alignas(64) PaddedWorkerCounter {
uint64_t samples;
uint64_t errors;
uint8_t padding[48];
};
PaddedWorkerCounter counters[8];
The padded object size is:
The counter array size changes from:
to:
The memory overhead is:
For this service, 384\ \text{bytes} is negligible. For millions of objects, the same padding strategy could be unacceptable. Padding is a targeted fix for hot independently written state, not a universal data-layout rule.
Corrected Performance
The padded version achieves:
Speedup relative to the single-thread baseline is:
Parallel efficiency is:
So:
The implementation still does not reach ideal scaling. Remaining overhead can come from instruction costs, aggregation work, memory ordering, branch behavior, scheduler noise, and measurement overhead. But the structural regression has been removed.
The throughput margin against the requirement is:
With:
the margin is:
The padded implementation has about:
throughput margin relative to the sustained requirement.
Aggregation Cost Check
Padding reduces hot-path false sharing, but the counters still need to be read by an aggregator. Suppose the system aggregates once every:
The aggregation frequency is:
Each aggregation reads eight padded cache lines:
The aggregation traffic is:
That is about:
This is tiny compared with the multi-GB/s ownership traffic implied by the unpadded hot path. The fix is therefore technically coherent: it trades a small memory increase and small aggregation read cost for a large reduction in write-sharing traffic.
Diagnosis
The evidence supports false sharing as the root cause:
| Observation | Engineering meaning |
|---|---|
| per-worker counters are adjacent in memory | independent variables share cache lines |
| throughput degrades when more threads write counters | scaling is limited by coherence, not arithmetic |
| cache-to-cache transfer counters are high | lines are moving between cores |
| padding restores throughput | data layout change removes the dominant coupling |
| tail latency improves after padding | coherence stalls were affecting latency, not only average throughput |
The root problem is not that multicore processing failed. The root problem is that the data layout contradicted the ownership model of the cache-coherence system.
Validation After the Fix
The corrected design should be released only with evidence that the performance improvement is real and that the memory tradeoff is acceptable.
Acceptance evidence should include:
- benchmark results for single-thread, unpadded eight-thread, and padded eight-thread variants;
- p50, p95, p99, and p99.9 latency under production-like load;
- hardware performance counters showing reduced cache-to-cache transfers or coherence stalls;
- source-level documentation stating why the padding exists;
- static assertions or tests preserving the intended counter size and alignment;
- CPU affinity test showing whether thread migration changes the result;
- memory-footprint review if the pattern is copied to larger data structures;
- regression benchmark in continuous integration for the hot counter path;
- production telemetry comparing update rate, latency, CPU utilization, and dropped messages after rollout.
The release criterion should connect performance to requirement, not only to improvement:
and:
under the released workload, thread count, hardware class, and compiler settings.
Engineering Lessons
The first lesson is that data ownership is physical, not only logical. If independent variables share a cache line, independent threads can still interfere.
The second lesson is that speedup must be measured against a baseline and an ideal bound. Eight busy cores do not prove useful parallel work.
The third lesson is that padding is a tradeoff. It is valuable for small hot structures with independent writers, but harmful if applied blindly to large arrays or cache-sensitive read-mostly data.
Good performance engineering therefore connects algorithmic complexity, data layout, cache behavior, profiling counters, latency percentiles, and release requirements. The fastest fix is often not a new algorithm. It is making the existing algorithm match the hardware memory system it actually runs on.