Case study

Cache False Sharing Performance Regression Case Study

Computer engineering case study on diagnosing a multicore performance regression caused by false sharing, cache-line ownership traffic, weak speedup, padding tradeoffs, and validation evidence.

False sharing is a performance failure in which different threads update different variables that happen to occupy the same cache line. The program is logically correct, and each thread may appear to own its own counter or state object. The hardware, however, manages ownership at cache-line granularity, so independent writes can force the same line to move between cores.

This case study follows a telemetry aggregation service that slows down after a multicore scaling change. The team expected eight worker threads to improve throughput. Instead, total throughput falls below the single-thread baseline and p99.9 latency violates the service target. The root cause is not algorithmic complexity or insufficient CPU frequency. It is false sharing in a per-thread counter array.

The purpose is to connect data layout, cache-line geometry, throughput measurement, speedup efficiency, coherence traffic, and validation evidence into one engineering diagnosis.

Case Context

The service receives machine telemetry, validates messages, updates per-worker counters, and flushes summary metrics once per second. The hot path increments counters for accepted samples, rejected samples, and parser states.

A refactor replaced one shared locked counter with an array of per-worker counters:

typedef struct {
  uint64_t samples;
  uint64_t errors;
} WorkerCounter;

WorkerCounter counters[8];

The change removed lock contention, but performance became worse. The apparent design intent was good: each worker writes only its own counter. The missing review was physical memory layout.

ItemValue
Worker threads8
Processor cores used8
Cache-line size64\ \text{bytes}
Counter object size16\ \text{bytes}
Single-thread baseline220\ \text{million updates/s}
Eight-thread unpadded throughput140\ \text{million updates/s}
Eight-thread padded throughput1.42\ \text{billion updates/s}
Required sustained throughput900\ \text{million updates/s}
Latency targetp99.9 below 1.0\ \text{ms}
Unpadded p99.9 latency2.8\ \text{ms}
Padded p99.9 latency0.42\ \text{ms}

The incident is important because the source code change looks like a standard scalability improvement. The performance failure is visible only when the data structure is interpreted as hardware traffic.

Initial Symptom

The deployment report contains three confusing facts:

  1. CPU utilization rises across all cores.
  2. Lock wait time falls compared with the old implementation.
  3. Throughput falls and tail latency increases.

This combination rules out the simplest explanation that a lock is blocking the system. It suggests that the workers are running, but the cores are spending time on memory-system coordination rather than useful work.

Useful profiling evidence includes:

EvidenceInterpretation
high cache-to-cache transfers on the counter addressescache-line ownership is moving between cores
low lock wait timenot primarily a mutex problem
worker CPU time high but useful throughput lowcores are busy with coherence overhead
performance improves when workers are pinnedscheduling contributes, but does not remove the root cause
performance improves when counters are paddeddata layout is causal

The last observation is decisive only if the padded version is tested under the same workload, compiler settings, thread count, and hardware.

Cache-Line Layout

The counter object uses:

S_c=16\ \text{bytes}

The cache-line size is:

L=64\ \text{bytes}

The number of counters that fit in one cache line is:

\displaystyle N_{line}=\frac{L}{S_c}=\frac{64}{16}=4

Therefore, the eight-worker array occupies two cache lines if aligned favorably:

Cache lineCounter objects
line 0worker 0, worker 1, worker 2, worker 3
line 1worker 4, worker 5, worker 6, worker 7

Each worker writes a different counter. At the source-code level, there is no shared variable. At the cache-coherence level, four workers repeatedly write the same cache line.

For a core to write its counter, it must obtain ownership of the cache line containing that counter. Another core writing a neighboring counter invalidates or transfers that same line. The line bounces between cores even though the variables are distinct.

Throughput and Speedup Check

The single-thread baseline is:

R_1=220\times10^6\ \text{updates/s}

With eight workers, ideal linear throughput would be:

R_{ideal}=8R_1
R_{ideal}=8(220\times10^6)=1.76\times10^9\ \text{updates/s}

The unpadded implementation achieves:

R_{unpadded}=140\times10^6\ \text{updates/s}

Speedup relative to the single-thread baseline is:

\displaystyle S_{unpadded}=\frac{R_{unpadded}}{R_1}
\displaystyle S_{unpadded}=\frac{140}{220}=0.64

The eight-thread implementation is slower than the single-thread baseline:

S_{unpadded}<1

Parallel efficiency is:

\displaystyle E_{unpadded}=\frac{S_{unpadded}}{8}
\displaystyle E_{unpadded}=\frac{0.64}{8}=0.080

Only about 8.0\% of the ideal eight-core scaling is being converted into useful throughput. That is not a tuning issue. It is evidence of a structural bottleneck.

Coherence Traffic Estimate

A first-pass lower-bound estimate treats each counter update as requiring ownership movement of a 64\ \text{byte} line. This is simplified, but it gives the correct engineering scale.

At:

R_{unpadded}=140\times10^6\ \text{updates/s}

minimum cache-line traffic associated with ownership movement is:

B_{coh}=R_{unpadded}L
B_{coh}=(140\times10^6)(64)
B_{coh}=8.96\times10^9\ \text{bytes/s}

So the coherence system may be forced to handle at least:

8.96\ \text{GB/s}

of cache-line movement for a counter update that carries only an 8\ \text{byte} increment. Actual traffic can be higher because protocols, retries, evictions, scheduling, and other memory operations add overhead.

The key point is not the exact bandwidth. The key point is that the implementation turns tiny independent counter writes into high-rate inter-core ownership traffic.

Padded Counter Design

The corrective design gives each worker counter its own cache line. In C++-style notation, the intent is both size and alignment:

struct alignas(64) PaddedWorkerCounter {
  uint64_t samples;
  uint64_t errors;
  uint8_t padding[48];
};

PaddedWorkerCounter counters[8];

The padded object size is:

S_p=64\ \text{bytes}

The counter array size changes from:

M_{old}=8(16)=128\ \text{bytes}

to:

M_{new}=8(64)=512\ \text{bytes}

The memory overhead is:

\Delta M=512-128=384\ \text{bytes}

For this service, 384\ \text{bytes} is negligible. For millions of objects, the same padding strategy could be unacceptable. Padding is a targeted fix for hot independently written state, not a universal data-layout rule.

Corrected Performance

The padded version achieves:

R_{padded}=1.42\times10^9\ \text{updates/s}

Speedup relative to the single-thread baseline is:

\displaystyle S_{padded}=\frac{1.42\times10^9}{220\times10^6}
S_{padded}=6.45

Parallel efficiency is:

\displaystyle E_{padded}=\frac{6.45}{8}=0.806

So:

E_{padded}=80.6\%

The implementation still does not reach ideal scaling. Remaining overhead can come from instruction costs, aggregation work, memory ordering, branch behavior, scheduler noise, and measurement overhead. But the structural regression has been removed.

The throughput margin against the requirement is:

\displaystyle M_R=\frac{R_{padded}-R_{required}}{R_{required}}

With:

R_{required}=900\times10^6\ \text{updates/s}

the margin is:

\displaystyle M_R=\frac{1.42\times10^9-900\times10^6}{900\times10^6}
M_R=0.578

The padded implementation has about:

57.8\%

throughput margin relative to the sustained requirement.

Aggregation Cost Check

Padding reduces hot-path false sharing, but the counters still need to be read by an aggregator. Suppose the system aggregates once every:

T_a=10\ \text{ms}

The aggregation frequency is:

\displaystyle f_a=\frac{1}{T_a}=\frac{1}{0.010}=100\ \text{Hz}

Each aggregation reads eight padded cache lines:

M_a=8(64)=512\ \text{bytes}

The aggregation traffic is:

B_a=M_af_a
B_a=512(100)=51{,}200\ \text{bytes/s}

That is about:

51.2\ \text{kB/s}

This is tiny compared with the multi-GB/s ownership traffic implied by the unpadded hot path. The fix is therefore technically coherent: it trades a small memory increase and small aggregation read cost for a large reduction in write-sharing traffic.

Diagnosis

The evidence supports false sharing as the root cause:

ObservationEngineering meaning
per-worker counters are adjacent in memoryindependent variables share cache lines
throughput degrades when more threads write countersscaling is limited by coherence, not arithmetic
cache-to-cache transfer counters are highlines are moving between cores
padding restores throughputdata layout change removes the dominant coupling
tail latency improves after paddingcoherence stalls were affecting latency, not only average throughput

The root problem is not that multicore processing failed. The root problem is that the data layout contradicted the ownership model of the cache-coherence system.

Validation After the Fix

The corrected design should be released only with evidence that the performance improvement is real and that the memory tradeoff is acceptable.

Acceptance evidence should include:

  1. benchmark results for single-thread, unpadded eight-thread, and padded eight-thread variants;
  2. p50, p95, p99, and p99.9 latency under production-like load;
  3. hardware performance counters showing reduced cache-to-cache transfers or coherence stalls;
  4. source-level documentation stating why the padding exists;
  5. static assertions or tests preserving the intended counter size and alignment;
  6. CPU affinity test showing whether thread migration changes the result;
  7. memory-footprint review if the pattern is copied to larger data structures;
  8. regression benchmark in continuous integration for the hot counter path;
  9. production telemetry comparing update rate, latency, CPU utilization, and dropped messages after rollout.

The release criterion should connect performance to requirement, not only to improvement:

R_{measured}>900\times10^6\ \text{updates/s}

and:

L_{\text{p99.9}}<1.0\ \text{ms}

under the released workload, thread count, hardware class, and compiler settings.

Engineering Lessons

The first lesson is that data ownership is physical, not only logical. If independent variables share a cache line, independent threads can still interfere.

The second lesson is that speedup must be measured against a baseline and an ideal bound. Eight busy cores do not prove useful parallel work.

The third lesson is that padding is a tradeoff. It is valuable for small hot structures with independent writers, but harmful if applied blindly to large arrays or cache-sensitive read-mostly data.

Good performance engineering therefore connects algorithmic complexity, data layout, cache behavior, profiling counters, latency percentiles, and release requirements. The fastest fix is often not a new algorithm. It is making the existing algorithm match the hardware memory system it actually runs on.

REF

See also