Deep Dive into CPU Cache Memory: Solving the Memory Wall

In this blog post, you will delve into the concepts of CPU caching and its various aspects. You will learn how this technology solves the "memory wall" problems for performance-critical applications.

Keywords: cache memory, memory hierarchy, L1 cache, write-back policy, write-through policy, write-allocate, non-blocking cache, blocking cache

Introduction: The Memory Wall Problem

Consider if a Formula 1 car is forced to refuel through a drinking straw. This reflects the challenge the CPU faces. As a processor executes instructions in nanoseconds, accessing the main memory, which is DRAM, takes hundreds of cycles. This performance gap is called the "memory wall," and it threatens to stall computational progress. Its solution is a sophisticated memory hierarchy using caching. These layers work in concert to deliver more than 95% of requested data within 1-2 cycles, masking memory latency and enabling modern computing. In this guide, we will optimize systems with cache memory write policies, non-blocking caches, and advanced techniques.

1. The Caching Pyramid: L1, L2, and L3

Caches are small in memory size, ultrafast static RAM (SRAM) units, and act as a temporary storage element between the CPU and slower main memory (DRAM). Their design exploits temporal locality (recently accessed data is reused) and spatial locality (adjacent data is likely needed soon).

Modern CPU Cache Structure
Cache level	Memory Size	Latency	Location	Associativity
L1 Instruction	32-64 KB	1-3 cycles	Per-core	4-8 way set-assoc
L1 Data	32-64 KB	1-3 cycles	Per-core	8-12 way set-assoc
L2	256KB-2MB	8-12 cycles	Per-core/Shared	16 way set-assoc
L3 (LLC)	16-128 MB	30-50 cycles	Shared across cores	16-32 way set-assoc

1.1 Write Policy:

Write-Through (WT): During each write operation, data is simultaneously written to the main memory and the cache. As it waits for slower memory writes (DRMA), it has high write latency. Because each write generates duplicate traffic, bandwidth usage rises. It is used in critical systems requiring strong consistency, e.g., financial databases and RAID controllers.
Write-Back (WB): Data is written to cache only initially. Main memory updates occur later, e.g., during cache line eviction. Modified lines are marked with a "dirty bit." It has low write latency as the CPU proceeds after the cache write. There is a risk of data loss on a power failure. Multi-core systems require sophisticated coherence protocols. It is employed in write-intensive workloads, e.g., video rendering and scientific simulations.
Write-Around (WA): It goes straight to the main memory, avoiding the cache. The cache is updated only if the data is read later. Cache pollution is reduced as it avoids filling the cache with one-time writes. It has a high read miss penalty if bypassed data is later accessed. It is used in logging systems or workloads with low read-after-write locality.

1.1.1 Allocation Policies: Handling Write Misses

When a write targets data not in the cache, policies decide whether to fetch the block:

Allocation Policies
Policy	Mechanism	Performance Impact
Write-Allocate	Loads block into the cache, then updates	Benefits read-after-write sequences and increase write latency
No-Write-Allocate	Write directly to memory; skips cache.	Faster for isolated writes and subsequent reads suffer misses

1.1.2 Common Pairings:

Write-Back + Write-Allocate: It maximizes repeated writes, such as CPU L1 caches.
Write-Through + No-Write-Allocate: This prevents needless cache loading, such as I/O buffers.

1.1.3 Measuring Cache Performance:

Hit Rate and Miss Rate:

The hit rate is defined as "cache hits divided by total cache accesses."

Hit Rate (HR) = Cache Hits Total Cache Accesses

Where the Miss rate is defined as "1 - HR"

Miss Rate (MR) = 1 - HR = Cache Misses Total Cache Accesses

Hit rates greater than 95% are typical for well-tuned L1 caches; L2/L3 might see 80-90 %.

Miss Penalties:

Miss penalty (MP) means additional cycles required to service a miss from the next level, e.g., L2 or DRAM. If an L1 miss costs 10 cycles to fetch from L2, then MP = 10 cycles.

Average Memory Access Time (AMAT):

AMAT gives a single-number view of performance impact:

AMAT = L1 Hit Time + (L1 Miss Rate) × L1 Miss Penalty

If you expand two levels:

AMAT = T_L1 + MR_L1 × ( T_L2 + MR_L2 × T_DRAM )

Where:

T_{L1 = L1 hit latency}
_{MR_{L1 = L1 miss rate}}
_{_{T_{L2 = L2 hit latency}}}
_{_{_{MR_{L2 = L2 miss rate}}}}
_{_{_{_{T_{DRAM = Dram access time}}}}}

1.2 Blocking vs. Non-Blocking Caches:

Blocking Cache:

Whenever there is a cache miss, the entire process pipeline stalls until data is fetched from lower memory levels. A single outstanding miss is allowed while subsequent requests wait in a queue. Its analogy can be a toll booth where cars wait in line, with only one vehicle processed at a time.

Non-Blocking Cache:

It keeps serving cache hits while permitting several unresolved misses. It uses Miss Status Handling Registers (MSHRs) to track pending requests. Its analogy can be a drive-thru, which has parallel ordering stations, and cars can place new orders while others are being prepared.

1.3 Conclusion:

Mastering cache memory is most important for breaking the memory wall. Optimizing your hierarchy, policies, and advanced features to accelerate applications from simulation to gaming.

Embedded Tech