Computer Architecture 07 - Memory Hierarchy

Why Is Fast Memory Expensive?

The ideal memory would be as fast as the CPU, unlimited in capacity, and inexpensive. In reality, no memory technology satisfies all three conditions simultaneously. Faster memory costs more per bit, and cheaper memory is slower to access. This is a physical constraint. SRAM requires six transistors to store a single bit, while DRAM needs only one transistor and one capacitor. Fewer transistors mean higher density and lower cost, but correspondingly slower access.

Is there a way to completely overcome this physical limitation? It cannot be overcome, but it can be circumvented. That strategy is the memory hierarchy.

The Memory Hierarchy Pyramid

The memory hierarchy is a design strategy that arranges multiple levels of memory with different speeds, capacities, and costs in a layered structure, creating the overall illusion of fast, large memory.

          ┌──────────┐
          │Registers  │  ~ 1 ns     tens to hundreds of bytes
          │           │
         ┌┴───────────┴┐
         │   L1 Cache   │  ~ 1 ns     32-64 KB
         │              │
        ┌┴──────────────┴┐
        │    L2 Cache     │  ~ 3-10 ns   256 KB - 1 MB
        │                 │
       ┌┴─────────────────┴┐
       │     L3 Cache       │  ~ 10-30 ns  4-64 MB
       │                    │
      ┌┴────────────────────┴┐
      │       DRAM            │  ~ 50-100 ns  8-128 GB
      │                      │
     ┌┴──────────────────────┴┐
     │        SSD              │  ~ 25-100 μs  256 GB - 4 TB
     │                        │
    ┌┴────────────────────────┴┐
    │         HDD               │  ~ 3-10 ms   1-20 TB
    └──────────────────────────┘

Registers at the top of the pyramid operate at CPU clock speed but number only a few dozen. HDDs at the bottom provide terabytes of capacity but have access times measured in milliseconds. Between registers and HDDs, there is roughly a millionfold difference in speed.

This hierarchy is effective because programs exhibit specific regularities in how they access memory.

Locality of Reference

Most programs do not use memory uniformly. At any given moment, the memory addresses being accessed tend to cluster around recently accessed addresses or their neighbors. This is called locality of reference.

Temporal locality means that recently accessed data is likely to be accessed again in the near future. Repeatedly reading and writing the same variable inside a loop is a typical example. Loop counters, accumulator variables, and condition flags are referenced continuously throughout loop execution.

Spatial locality means that data near an accessed address is also likely to be accessed soon. Sequentially traversing an array is the classic case. If arr[0] has been accessed, the prediction that arr[1] and arr[2] will follow is correct in most cases.

Each level of the memory hierarchy exploits precisely this locality. By keeping frequently used data in faster tiers, it minimizes the number of times the CPU must descend to slower memory.

Cache Organization

A cache is a small, high-speed memory located between the CPU and main memory that holds copies of frequently used data. When the CPU requests data, it first checks the cache. If the desired data is present, it is returned quickly as a cache hit. If absent, it is a cache miss, and the data must be fetched from a lower tier.

Caches manage data not in individual bytes but in units called cache lines. In modern processors, the cache line size is typically 64 bytes. Even if only 1 byte is needed, the entire 64 bytes are fetched, exploiting spatial locality. Nearby data is likely to be needed soon, so fetching it together is efficient.

How is it determined where in the cache a particular memory address is stored? This is the cache mapping scheme, and there are three fundamental strategies.

In a direct-mapped cache, each memory address maps to exactly one location in the cache. A portion of the address bits serves as an index to select the cache line. The implementation is simple and lookups are fast, but when different addresses map to the same cache location, they repeatedly evict each other, causing conflict misses.

Memory address:  [  Tag  |  Index  |  Offset  ]
                   │         │          │
                   │         │          └─▶ Byte position within cache line
                   │         └────────────▶ Cache set selection
                   └──────────────────────▶ Verifies origin of stored data

In a fully-associative cache, data can be stored in any location. Conflict misses do not occur, but finding data requires simultaneously comparing all cache line tags, making the hardware cost very high. This scheme is used only for special purposes such as small TLBs (Translation Lookaside Buffers).

A set-associative cache is a compromise between the two. The cache is divided into multiple sets, and each memory address maps to one set, but within that set it can be stored in any of several positions (ways). In an 8-way set-associative cache, each set contains 8 cache lines, so up to 8 addresses mapping to the same set can coexist without conflict. Modern L1 caches typically use 4-way or 8-way set-associative structures.

Types of Cache Misses

Cache misses fall into three categories. Cold (compulsory) misses occur on the first access to data and are unavoidable. Capacity misses occur when the total cache size is smaller than the working set. Conflict misses occur when cache capacity is sufficient but mapping collisions cause evictions, and they appear only in direct-mapped or set-associative caches.

In performance-sensitive code, identifying which type of miss is occurring is the starting point for optimization. For capacity misses, the data structure needs to be compressed or the working set reduced. For conflict misses, data placement must be changed, or data should be promoted to a cache level with higher associativity.

Write Policies

When the CPU modifies data, a choice must be made about how to maintain consistency between the cache and main memory.

The write-through policy writes to main memory simultaneously with every cache write. Memory always holds the latest data, simplifying implementation, but every write incurs a slow memory access, hurting performance. A write buffer is commonly used to prevent the CPU from stalling on memory writes.

The write-back policy writes only to the cache and reflects changes to main memory only when the cache line is evicted. Modified cache lines have a dirty bit set, indicating they must be written back upon eviction. This greatly reduces the number of memory writes, yielding excellent performance, but data between cache and memory may be temporarily inconsistent. Modern L1 and L2 caches almost universally use the write-back policy.

The Cache Coherence Problem

In a single core, inconsistencies between cache and memory are transparently managed within the hardware. In multicore environments, however, the problem becomes complicated. If Core A modifies variable x in its cache, and Core B reads the same variable x from its own cache, Core B sees a stale value.

This is called the cache coherence problem, and it must be solved automatically by hardware. The most widely used protocol is the MESI protocol, which manages each cache line in one of four states.

State	Meaning
Modified (M)	Only this core has it; newer than memory
Exclusive (E)	Only this core has it; same as memory
Shared (S)	Multiple cores have it; same as memory
Invalid (I)	Data is not valid

When Core A wants to modify a cache line in the Shared state, it first sends an invalidate message to other cores, marking their copies as Invalid. It then changes its own cache line to the Modified state and writes the new value. When Core B subsequently reads the same address, it obtains the latest value from Core A's Modified cache line.

This process is performed automatically by hardware, but it is not free. Inter-core message exchanges take tens of clock cycles. When multiple cores in a multithreaded program frequently modify data in the same cache line, a flood of invalidation messages causes performance to degrade sharply. This is called false sharing, and it occurs when different variables happen to reside in the same cache line.

DRAM Internals

DRAM, used as main memory, is internally organized as a two-dimensional array. Each bit is stored as the presence or absence of charge in a capacitor, accessed by row and column coordinates.

DRAM access occurs in two stages. First, the row address is sent along with the RAS (Row Address Strobe) signal, and the entire row is copied into a row buffer called a sense amplifier. This process is called row activation. Then the column address is sent with the CAS (Column Address Strobe) signal, and the data at that column is output from the row buffer.

Reading different columns from an already activated row is fast, requiring only the CAS latency. Accessing a different row, however, requires closing the current row and activating a new one, which takes considerably longer. This is why DRAM's effective performance varies greatly depending on the memory access pattern. Sequential access being faster than random access is precisely due to this characteristic.

Another property of DRAM is that it requires refresh. Capacitor charge naturally dissipates over time, so typically every 64 milliseconds all rows must be read and rewritten to restore their charge. During refresh, the affected bank cannot be accessed, which also impacts performance.

Memory Latency by the Numbers

Comparing access times across memory hierarchy levels makes the importance of caches unmistakable.

Operation	Approximate Time
L1 cache hit	1 ns
L2 cache hit	3-10 ns
L3 cache hit	10-30 ns
DRAM access	50-100 ns
NVMe SSD random read	25-100 μs
SATA SSD random read	100-200 μs
HDD random read	3-10 ms

Between an L1 cache hit and a DRAM access, there is roughly a 100x difference. Between DRAM and HDD, the gap is approximately 100,000x. Whether a program's working set fits in L1, spills to L3, or falls through to DRAM determines performance dramatically.

These numbers explain why the memory layout of data structures matters as much as an algorithm's time complexity. An algorithm that is theoretically more efficient but causes many cache misses can perform worse in practice than a simpler, cache-friendly algorithm. The reason arrays outperform linked lists in most cases stems directly from this difference in spatial locality.

Memory Bandwidth

Alongside access latency, memory bandwidth is an equally important metric. Bandwidth refers to the amount of data that can be transferred per unit time. Modern DDR5 memory provides a theoretical bandwidth of roughly 50-60 GB/s per channel. A dual-channel configuration doubles this figure.

In practice, however, fully utilizing this bandwidth is not straightforward. Random access patterns significantly reduce effective bandwidth due to row activation overhead. To maximize bandwidth utilization, it is necessary to maintain sequential access patterns, leverage prefetching, and issue multiple memory requests in parallel. The CPU's hardware prefetcher detects access patterns and speculatively fetches data predicted to be needed into the cache, operating most effectively with sequential or regular-stride access patterns.

In the next post, we'll look at virtual memory.

Where to go next