Exercise set

Cache, Memory, and Performance Exercises

Worked computer engineering exercises for cache behavior, memory footprint, average memory access time, locality, bus bandwidth, transfer time, queue utilization, Amdahl speedup, tail latency, and performance-validation evidence.

These exercises practise cache, memory, and performance analysis as computer engineering evidence. They connect architecture, data structures, workload shape, memory hierarchy, bus limits, queueing, and validation measurements.

The goal is not only to calculate a runtime or memory value. The goal is to decide whether the system has enough margin, whether the benchmark represents the real workload, and whether a performance claim survives cache effects, memory pressure, contention, and tail latency.

Assume simplified screening models unless an exercise states otherwise. Real systems should also check compiler settings, runtime version, processor variant, cache policy, memory wait states, DMA behavior, interrupts, operating-system scheduling, thermal throttling, and measurement overhead.

How to Use These Exercises

For each exercise, define:

  1. the workload and operation being measured;
  2. the data size, data layout, and access pattern;
  3. the relevant memory level, bus, queue, or processor limit;
  4. the latency or throughput requirement;
  5. the validation evidence needed before accepting the result.

The common mistake is treating Big O, clock speed, or benchmark averages as complete evidence. Performance failures often come from memory locality, queue buildup, bus contention, rare input shapes, and high-percentile latency.

Exercise 1: Cache Lines Touched by an Array

An array contains:

n=4096

32-bit values. The cache line size is:

L=64\ \text{bytes}

Assume the array is aligned to a cache-line boundary. Estimate how many cache lines are touched by one sequential scan.

Solution

Each 32-bit value uses:

S_e=4\ \text{bytes}

Total array size:

M=nS_e=4096(4)=16384\ \text{bytes}

Number of cache lines:

\displaystyle N_{lines}=\frac{M}{L}=\frac{16384}{64}=256

Engineering Comment

A sequential scan has strong spatial locality. The estimate assumes aligned storage and no conflict effects. Real measurements should confirm cache misses, prefetch behavior, competing memory traffic, and whether the scan fits the surrounding workload.

Exercise 2: Average Memory Access Time

A processor has cache hit time:

T_{hit}=2.5\ \text{ns}

miss rate:

m=0.04

and miss penalty:

T_{miss}=70\ \text{ns}

Estimate average memory access time:

AMAT=T_{hit}+mT_{miss}

Solution

Substitute:

AMAT=2.5+0.04(70)=5.3\ \text{ns}

Engineering Comment

The 4% miss rate more than doubles effective access time. A small miss-rate change can matter more than a small clock-speed change. The result should be measured with representative data size, access order, and cache-warm state.

Exercise 3: Memory Footprint of a Dense Table

A dense table stores:

n=250000

records. Each record has:

S_e=32\ \text{bytes}

of payload and the table has:

S_{header}=128\ \text{bytes}

of fixed metadata. Estimate total memory:

M\approx nS_e+S_{header}

Solution

Compute:

M=250000(32)+128=8000128\ \text{bytes}

Using decimal megabytes:

M\approx 8.00\ \text{MB}

Engineering Comment

The payload fits easily in many systems, but the release budget should include alignment, allocator overhead, indexes, staging buffers, logs, serialization, and peak memory during update or copy operations.

Exercise 4: Pointer-Based Node Overhead

A tree stores:

n=100000

nodes. Each node contains:

S_e=24\ \text{bytes}

of payload, two 8-byte pointers, and:

S_m=16\ \text{bytes}

of allocator or metadata overhead. Estimate total memory:

M\approx n(S_e+pS_p+S_m)

with:

p=2,\quad S_p=8\ \text{bytes}

Solution

Memory per node:

S_{node}=24+2(8)+16=56\ \text{bytes}

Total:

M=100000(56)=5600000\ \text{bytes}

So:

M\approx 5.6\ \text{MB}

Engineering Comment

The pointer-based structure uses less payload than total memory. It may also have poor locality because nodes can be scattered. For scan-heavy workloads, a packed array or B-tree-like layout may outperform a theoretically elegant pointer tree.

Exercise 5: Cache-Fit Screening

A hot working set has:

M_{work}=96\ \text{kB}

The processor has:

M_{L1}=32\ \text{kB}

and:

M_{L2}=512\ \text{kB}

Does the working set fit in L1 and L2?

Solution

Compare sizes:

96\ \text{kB}>32\ \text{kB}

so the working set does not fit in L1.

Also:

96\ \text{kB}<512\ \text{kB}

so it can fit in L2 under ideal conditions.

Engineering Comment

This screen suggests L1 misses may be important while L2 can absorb much of the working set. Real behavior depends on associativity, conflicts, code footprint, other data, prefetching, interrupts, and whether the working set is reused before eviction.

Exercise 6: Row-Major Locality

A matrix has:

1024\times1024

elements, each:

8\ \text{bytes}

The language stores rows contiguously. How many bytes are in one row, and why can column-wise traversal be slower?

Solution

One row contains:

1024(8)=8192\ \text{bytes}

Therefore one row is:

8\ \text{kB}

In row-wise traversal, adjacent accesses usually move through contiguous memory. In column-wise traversal, consecutive accesses jump by one full row:

8192\ \text{bytes}

Engineering Comment

The algorithmic operation count is the same for both traversals, but memory locality is different. Column-wise traversal can waste cache lines and defeat prefetching. Performance review should include access pattern, not only complexity.

Exercise 7: Raw and Effective Bus Bandwidth

A memory bus has width:

w=64\ \text{bits}

clock rate:

f=200\ \text{MHz}

and effective efficiency:

\eta=0.70

Estimate effective bandwidth:

R_{eff}=wf\eta

Solution

Raw bandwidth:

R_{raw}=64(200\times10^6)=12.8\times10^9\ \text{bit/s}

Effective bandwidth:

R_{eff}=12.8(0.70)=8.96\ \text{Gbit/s}

In bytes per second:

\displaystyle R_{eff}=\frac{8.96}{8}=1.12\ \text{GB/s}

Engineering Comment

Efficiency accounts for overhead, wait states, arbitration, refresh, contention, and protocol limits. If measured bandwidth is lower, check burst size, alignment, cache policy, DMA setup, and competing bus masters.

Exercise 8: Transfer Time with Setup Latency

A block transfer moves:

B=6\ \text{MB}

over an effective link:

R=300\ \text{MB/s}

The transfer also has fixed setup latency:

T_0=40\ \mu\text{s}

Estimate total transfer time:

\displaystyle T_{total}=T_0+\frac{B}{R}

Solution

Payload transfer time:

\displaystyle \frac{B}{R}=\frac{6}{300}=0.020\ \text{s}

Convert setup latency:

T_0=0.000040\ \text{s}

Total:

T_{total}=0.020040\ \text{s}

Therefore:

T_{total}=20.04\ \text{ms}

Engineering Comment

For a large transfer, setup latency is almost negligible. For small messages, the same setup latency may dominate. A benchmark should include the message sizes used by the real system.

Exercise 9: Hash Table Load Factor

A hash table stores:

n=36000

entries in:

m=48000

buckets. Compute load factor:

\displaystyle \alpha=\frac{n}{m}

Solution

Substitute:

\displaystyle \alpha=\frac{36000}{48000}=0.75

Engineering Comment

A load factor of 0.75 can be reasonable, but collision behavior depends on hash quality, probing strategy, key distribution, and resizing policy. Real-time systems should avoid unbounded resize pauses or collision chains.

Exercise 10: Dynamic Array Growth Cost

A dynamic array grows from capacity:

C=100000

to:

2C=200000

when a new element is appended. Each existing element is:

S_e=16\ \text{bytes}

Estimate bytes copied during the resize.

Solution

The resize copies the existing capacity:

B_{copy}=CS_e=100000(16)=1600000\ \text{bytes}

Therefore:

B_{copy}=1.6\ \text{MB}

Engineering Comment

Amortized append cost may be constant, but this single resize can create a latency spike. Systems with hard deadlines may need reserved capacity, bounded containers, chunked storage, or incremental migration.

Exercise 11: Queue Utilization

Requests arrive at:

\lambda=180\ \text{requests/s}

There are:

c=3

workers. Each worker can serve:

\mu=80\ \text{requests/s}

Estimate utilization:

\displaystyle \rho=\frac{\lambda}{c\mu}

Solution

Total service capacity:

c\mu=3(80)=240\ \text{requests/s}

Utilization:

\displaystyle \rho=\frac{180}{240}=0.75

Engineering Comment

The average load is below capacity, but burstiness, service-time variation, priority work, locks, and retries can still create tail latency. Queue validation should include high-percentile latency and overload behavior.

Exercise 12: Amdahl Speedup

A workload has serial fraction:

s=0.12

and runs on:

p=8

processors. Estimate ideal speedup:

\displaystyle Speedup(p)=\frac{1}{s+\frac{1-s}{p}}

Solution

Substitute:

\displaystyle Speedup(8)=\frac{1}{0.12+\frac{0.88}{8}}
\displaystyle Speedup(8)=\frac{1}{0.12+0.11}
Speedup(8)=4.35

Engineering Comment

Eight processors do not give 8x speedup because serial work remains. Real speedup can be lower because of synchronization, memory bandwidth, cache coherence, load imbalance, and I/O.

Exercise 13: Tail Latency from Measurements

A benchmark records these request latencies in milliseconds:

4,\ 5,\ 5,\ 6,\ 6,\ 7,\ 8,\ 9,\ 15,\ 42

Compute the maximum observed latency and explain why the average is insufficient.

Solution

The maximum observed latency is:

T_{max}=42\ \text{ms}

The average is:

\displaystyle \bar{T}=\frac{4+5+5+6+6+7+8+9+15+42}{10}=10.7\ \text{ms}

Engineering Comment

The 10.7 ms average hides a 42 ms stall. For user-facing, real-time, or service-level systems, high-percentile latency and deadline misses may matter more than the mean.

Exercise 14: Performance Claim Validation

A cache optimization reduces average runtime from:

T_{old}=80\ \text{ms}

to:

T_{new}=52\ \text{ms}

Compute percentage improvement relative to the old runtime:

\displaystyle I=\frac{T_{old}-T_{new}}{T_{old}}

Solution

Substitute:

\displaystyle I=\frac{80-52}{80}=0.35

Therefore:

I=35\%

Engineering Comment

A 35% average improvement is valuable, but release evidence should include input distribution, cold-cache and warm-cache results, memory footprint, tail latency, processor and compiler version, thermal state, and regression results for correctness.

REF

See also