Exercise set
Cache, Memory, and Performance Exercises
Worked computer engineering exercises for cache behavior, memory footprint, average memory access time, locality, bus bandwidth, transfer time, queue utilization, Amdahl speedup, tail latency, and performance-validation evidence.
These exercises practise cache, memory, and performance analysis as computer engineering evidence. They connect architecture, data structures, workload shape, memory hierarchy, bus limits, queueing, and validation measurements.
The goal is not only to calculate a runtime or memory value. The goal is to decide whether the system has enough margin, whether the benchmark represents the real workload, and whether a performance claim survives cache effects, memory pressure, contention, and tail latency.
Assume simplified screening models unless an exercise states otherwise. Real systems should also check compiler settings, runtime version, processor variant, cache policy, memory wait states, DMA behavior, interrupts, operating-system scheduling, thermal throttling, and measurement overhead.
How to Use These Exercises
For each exercise, define:
- the workload and operation being measured;
- the data size, data layout, and access pattern;
- the relevant memory level, bus, queue, or processor limit;
- the latency or throughput requirement;
- the validation evidence needed before accepting the result.
The common mistake is treating Big O, clock speed, or benchmark averages as complete evidence. Performance failures often come from memory locality, queue buildup, bus contention, rare input shapes, and high-percentile latency.
Exercise 1: Cache Lines Touched by an Array
An array contains:
32-bit values. The cache line size is:
Assume the array is aligned to a cache-line boundary. Estimate how many cache lines are touched by one sequential scan.
Solution
Each 32-bit value uses:
Total array size:
Number of cache lines:
Engineering Comment
A sequential scan has strong spatial locality. The estimate assumes aligned storage and no conflict effects. Real measurements should confirm cache misses, prefetch behavior, competing memory traffic, and whether the scan fits the surrounding workload.
Exercise 2: Average Memory Access Time
A processor has cache hit time:
miss rate:
and miss penalty:
Estimate average memory access time:
Solution
Substitute:
Engineering Comment
The 4% miss rate more than doubles effective access time. A small miss-rate change can matter more than a small clock-speed change. The result should be measured with representative data size, access order, and cache-warm state.
Exercise 3: Memory Footprint of a Dense Table
A dense table stores:
records. Each record has:
of payload and the table has:
of fixed metadata. Estimate total memory:
Solution
Compute:
Using decimal megabytes:
Engineering Comment
The payload fits easily in many systems, but the release budget should include alignment, allocator overhead, indexes, staging buffers, logs, serialization, and peak memory during update or copy operations.
Exercise 4: Pointer-Based Node Overhead
A tree stores:
nodes. Each node contains:
of payload, two 8-byte pointers, and:
of allocator or metadata overhead. Estimate total memory:
with:
Solution
Memory per node:
Total:
So:
Engineering Comment
The pointer-based structure uses less payload than total memory. It may also have poor locality because nodes can be scattered. For scan-heavy workloads, a packed array or B-tree-like layout may outperform a theoretically elegant pointer tree.
Exercise 5: Cache-Fit Screening
A hot working set has:
The processor has:
and:
Does the working set fit in L1 and L2?
Solution
Compare sizes:
so the working set does not fit in L1.
Also:
so it can fit in L2 under ideal conditions.
Engineering Comment
This screen suggests L1 misses may be important while L2 can absorb much of the working set. Real behavior depends on associativity, conflicts, code footprint, other data, prefetching, interrupts, and whether the working set is reused before eviction.
Exercise 6: Row-Major Locality
A matrix has:
elements, each:
The language stores rows contiguously. How many bytes are in one row, and why can column-wise traversal be slower?
Solution
One row contains:
Therefore one row is:
In row-wise traversal, adjacent accesses usually move through contiguous memory. In column-wise traversal, consecutive accesses jump by one full row:
Engineering Comment
The algorithmic operation count is the same for both traversals, but memory locality is different. Column-wise traversal can waste cache lines and defeat prefetching. Performance review should include access pattern, not only complexity.
Exercise 7: Raw and Effective Bus Bandwidth
A memory bus has width:
clock rate:
and effective efficiency:
Estimate effective bandwidth:
Solution
Raw bandwidth:
Effective bandwidth:
In bytes per second:
Engineering Comment
Efficiency accounts for overhead, wait states, arbitration, refresh, contention, and protocol limits. If measured bandwidth is lower, check burst size, alignment, cache policy, DMA setup, and competing bus masters.
Exercise 8: Transfer Time with Setup Latency
A block transfer moves:
over an effective link:
The transfer also has fixed setup latency:
Estimate total transfer time:
Solution
Payload transfer time:
Convert setup latency:
Total:
Therefore:
Engineering Comment
For a large transfer, setup latency is almost negligible. For small messages, the same setup latency may dominate. A benchmark should include the message sizes used by the real system.
Exercise 9: Hash Table Load Factor
A hash table stores:
entries in:
buckets. Compute load factor:
Solution
Substitute:
Engineering Comment
A load factor of 0.75 can be reasonable, but collision behavior depends on hash quality, probing strategy, key distribution, and resizing policy. Real-time systems should avoid unbounded resize pauses or collision chains.
Exercise 10: Dynamic Array Growth Cost
A dynamic array grows from capacity:
to:
when a new element is appended. Each existing element is:
Estimate bytes copied during the resize.
Solution
The resize copies the existing capacity:
Therefore:
Engineering Comment
Amortized append cost may be constant, but this single resize can create a latency spike. Systems with hard deadlines may need reserved capacity, bounded containers, chunked storage, or incremental migration.
Exercise 11: Queue Utilization
Requests arrive at:
There are:
workers. Each worker can serve:
Estimate utilization:
Solution
Total service capacity:
Utilization:
Engineering Comment
The average load is below capacity, but burstiness, service-time variation, priority work, locks, and retries can still create tail latency. Queue validation should include high-percentile latency and overload behavior.
Exercise 12: Amdahl Speedup
A workload has serial fraction:
and runs on:
processors. Estimate ideal speedup:
Solution
Substitute:
Engineering Comment
Eight processors do not give 8x speedup because serial work remains. Real speedup can be lower because of synchronization, memory bandwidth, cache coherence, load imbalance, and I/O.
Exercise 13: Tail Latency from Measurements
A benchmark records these request latencies in milliseconds:
Compute the maximum observed latency and explain why the average is insufficient.
Solution
The maximum observed latency is:
The average is:
Engineering Comment
The 10.7 ms average hides a 42 ms stall. For user-facing, real-time, or service-level systems, high-percentile latency and deadline misses may matter more than the mean.
Exercise 14: Performance Claim Validation
A cache optimization reduces average runtime from:
to:
Compute percentage improvement relative to the old runtime:
Solution
Substitute:
Therefore:
Engineering Comment
A 35% average improvement is valuable, but release evidence should include input distribution, cold-cache and warm-cache results, memory footprint, tail latency, processor and compiler version, thermal state, and regression results for correctness.