Computer Architecture 10 - Multicore and Modern Processors

Why Clock Speeds Stopped Increasing

Until the early 2000s, the formula for processor performance improvement was straightforward. Just increase the clock speed. From 100MHz to 1GHz to 3GHz, clock speeds climbed steeply every year. Then, around 2004, this ascent suddenly stopped. More than twenty years later, desktop processor base clocks still hover around 3 to 5GHz. What created this wall?

Three walls acted simultaneously. The first is the power wall. A processor's dynamic power consumption is proportional to clock frequency and proportional to the square of voltage. Since increasing the clock requires raising voltage as well, power consumption surges to the point where the chip generates more heat than can be physically dissipated.

The second is the ILP wall (Instruction-Level Parallelism wall). There is a fundamental limit to the instruction-level parallelism that can be extracted through superscalar and out-of-order execution. Data dependencies and control dependencies in programs determine this limit, and no amount of hardware complexity can break through it.

The third is the memory wall. CPU performance improved by tens of percent annually, but memory latency improvement was only a few percent per year. No matter how fast the CPU became, the time spent waiting for data from memory increasingly became the bottleneck.

Multicore as the Solution

When single-core performance improvement hit its ceiling, processor designers changed direction. Instead of making a single core faster, they integrated multiple cores onto a single chip. Since Intel's Pentium D and AMD's Athlon 64 X2 introduced dual cores to the consumer market in 2005, core counts have steadily increased, with 16 to 24 cores now common even on desktops.

The performance benefit of multicore manifests in parallelizable workloads. Distributing independent tasks across multiple cores allows throughput to scale proportionally with the number of cores. However, as Amdahl's Law shows, the sequential portion of a program determines the upper bound of overall speedup. If 10% of a program is sequential, no amount of additional cores can achieve more than a 10x speedup.

SMP Architecture

Symmetric Multi-Processing (SMP) is a structure where all processors have equal access to the same shared memory. Regardless of which core accesses which memory address, the latency is the same. From the operating system's perspective, all cores are equivalent, so scheduling is relatively straightforward since placing a process on any core yields the same performance.

However, SMP has scalability limitations. As core counts increase, shared memory bus bandwidth becomes a bottleneck, and traffic for maintaining cache coherence also surges. SMP generally operates efficiently up to 8 to 16 cores; beyond that, a different approach is needed.

NUMA: Non-Uniform Memory Access

NUMA (Non-Uniform Memory Access) is an architecture designed to overcome the scalability limitations of SMP. Each processor (or group of processors) has its own local memory nearby, and while it can access other processors' memory, the latency is longer.

┌─────────────────────┐     ┌─────────────────────┐
│     NUMA Node 0     │     │     NUMA Node 1     │
│  ┌──────┐ ┌──────┐  │     │  ┌──────┐ ┌──────┐  │
│  │Core 0│ │Core 1│  │     │  │Core 4│ │Core 5│  │
│  │Core 2│ │Core 3│  │     │  │Core 6│ │Core 7│  │
│  └──────┘ └──────┘  │     │  └──────┘ └──────┘  │
│   Local Memory 32GB  │◀──▶│   Local Memory 32GB  │
│  (access: ~80ns)     │Inter│  (access: ~80ns)     │
│                     │conn.│                     │
│  Remote: ~130ns      │     │  Remote: ~130ns      │
└─────────────────────┘     └─────────────────────┘

In NUMA environments, placing data in the same node's memory as the cores that use it is critical for performance. In database or virtualization workloads, failing to use NUMA-aware memory allocation leads to frequent remote memory accesses and significant performance degradation. Ignoring NUMA topology on server systems is one of the most common performance mistakes.

Cache Coherence: The MESI Protocol

In multicore processors, each core has its own cache. When multiple cores hold data for the same memory address in their respective caches and one core modifies that data, the copies in other cores' caches become stale. Cache coherence protocols solve this problem.

In the widely used MESI protocol, each cache line has one of four states.

State	Meaning
Modified (M)	Only this cache holds it; differs from memory (modified)
Exclusive (E)	Only this cache holds it; matches memory
Shared (S)	Multiple caches hold it; matches memory
Invalid (I)	Cache line is not valid

When core A wants to modify data in Shared state, it must first send an invalidate message to all other cores to change their cache lines to Invalid, then transition its own cache line to Modified state. This process is performed automatically by hardware and is transparent to the programmer, but propagating and acknowledging the invalidation takes time.

This is the cause of the false sharing problem in multithreaded programming. Even when different cores modify logically independent variables, if those variables reside on the same cache line, the entire cache line is repeatedly invalidated, resulting in significant performance degradation.

Memory Ordering and Memory Barriers

Modern processors can reorder memory accesses for performance. Even if program code writes A before B, another core may observe B as having changed before A. In single-threaded programs this is not an issue since the processor guarantees results as if instructions executed in program order, but in multithreaded environments this reordering can cause subtle, hard-to-reproduce bugs.

Memory barriers (or fences) are instructions that prevent such reordering. They guarantee that memory accesses before the barrier complete before any accesses after it. Synchronization mechanisms such as locks, atomic operations, and volatile variables internally use appropriate memory barriers to ensure memory visibility.

x86 adopts a relatively strong memory ordering model (TSO, Total Store Order) that behaves intuitively in most cases, but architectures like ARM use weak memory ordering models requiring more memory barriers. This is precisely why the same multithreaded code may work correctly on x86 but exhibit bugs when ported to ARM.

Simultaneous Multithreading: SMT and Hyper-Threading

Simultaneous Multithreading (SMT) is a technique where a single physical core executes multiple hardware threads concurrently. Intel markets this as Hyper-Threading.

How can a single core execute two threads simultaneously? The key insight is that a core's execution resources are not always 100% utilized. While one thread waits for memory due to a cache miss, the execution units sit idle. SMT fills this idle time by interleaving instructions from multiple threads on a single core. Each hardware thread has its own set of registers and program counter, but execution resources such as the ALU, caches, and branch predictor are shared.

SMT typically provides 10 to 30 percent performance improvement, but for compute-intensive workloads that already fully utilize execution units, the benefit may be negligible or performance may even degrade due to cache contention.

GPU Architecture Overview

GPUs (Graphics Processing Units) embody a fundamentally different design philosophy from CPUs. While CPUs maximize single-thread performance with a few complex cores, GPUs optimize for massive parallelism with thousands of simple cores.

The GPU's basic execution model is SIMT (Single Instruction, Multiple Threads), an extension of SIMD (Single Instruction, Multiple Data). A single instruction is applied simultaneously to tens or hundreds of threads. This structure is why GPUs can achieve tens of times higher throughput than CPUs for workloads that apply the same operation independently to large amounts of data, such as matrix operations, image processing, and physics simulations.

The explosive growth of deep learning has recently expanded the GPU's role far beyond graphics rendering into general-purpose parallel computing (GPGPU). Neural network training and inference are fundamentally large-scale matrix multiplications, which align perfectly with GPU architecture.

Heterogeneous Computing

One prominent trend in modern processor design is heterogeneous computing. Rather than designing all cores identically, this approach combines cores with different characteristics suited to different purposes.

ARM's big.LITTLE architecture places high-performance big cores alongside low-power little cores. Lightweight tasks like checking email are handled by the little cores to conserve battery, while demanding tasks like gaming or video editing engage the big cores. Intel's recent processors have similarly adopted a heterogeneous structure combining P-cores (Performance) and E-cores (Efficiency).

More broadly, the collaboration between CPU and GPU is itself heterogeneous computing. Apple's M-series chips integrate CPU, GPU, Neural Engine, and media engines into a single SoC, leveraging the most suitable processing unit for each workload. The CPU handles general-purpose processing, the GPU handles parallel computation, and the NPU handles machine learning inference.

The Evolution of Modern CPUs

Modern CPUs are the cumulative result of decades of architectural innovation. Branch predictors achieve over 99% prediction accuracy using advanced algorithms like TAGE (Tagged Geometric History Length). Out-of-order execution windows simultaneously track hundreds of instructions and execute them as soon as dependencies are resolved. Prefetchers learn memory access patterns and preload needed data into cache.

Speculative execution minimizes pipeline idle time by executing instructions along predicted branch paths before the branch outcome is confirmed. The Spectre and Meltdown vulnerabilities discovered in 2018 demonstrated that speculative execution can have security side effects, but the performance benefits are too significant to abandon it. Instead, modern processors have introduced hardware-level mitigations for speculative execution side channels, seeking a balance between performance and security.

Wrapping Up the Series

This series started from the basic structure of the Von Neumann architecture and progressed through CPU internals, instruction set architectures, pipelining and hazards, privilege levels and interrupts, memory hierarchy, virtual memory, I/O and DMA, and finally multicore and modern processors.

All these concepts are not independent but intimately interconnected. Page table walks in virtual memory depend on cache hierarchy performance, and TLB misses cause pipeline stalls. Multicore cache coherence protocols directly affect how the memory hierarchy behaves, and DMA combined with virtual memory necessitates new protection mechanisms like the IOMMU. Interrupts and privilege levels form the foundation of I/O handling and virtual memory page fault processing.

Understanding computer architecture is ultimately not about knowing individual components in isolation, but understanding how these components interact to function as a unified system. Software developers need this intuition about component interactions to diagnose performance problems, predict system behavior, and make sound design decisions. It is my hope that this series has served as a starting point for developing that intuition.