读取一字节内存时发生了什么

文章發布時間 2012年12月28日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2012-12-28: What Happens When You Load a Byte of Memory · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2012-12-28-what-happens-when-you-load-a-byte-of-memory.html 发布：2012-12-28　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

我们应用运行所在的硬件和软件其复杂程度令人咋舌，而当系统加载内存数据时所经历的那些曲折过程，正是这种复杂性的最佳体现。当我们加载一个字节的内存数据时，究竟发生了什么？博客的读者兼朋友 Guy English 建议我专门写一篇文章来解答这个问题。

代码让我们先从加载内存字节的代码说起。在 C 语言中，代码大致是这样的：

1
    char *addr = ...;
2
    char value = *addr;

在 x86-64 平台上，该代码被编译为大致类似于：

1
    movsbl (%rdi),%eax

这指示 CPU 将存储在 % rdi 地址中的字节加载到 % eax 寄存器中。在 ARM 架构上，编译器生成的代码则是：

1
    ldrsb.w r0, [r0]

虽然指令名称不同，但其效果基本相同。该指令从 r0 存储的地址处加载一个字节，并将该值存入 r0。（编译器在此复用了 r0 寄存器，因为地址已不再需要。）

现在 CPU 已获得指令，软件部分的任务就完成了。嗯，或许如此。

指令解码与执行
我不想过于深入探讨 CPU 如何执行代码的一般机制。简而言之，CPU 从内存加载上述指令，对其进行解码以确定操作码（opcode）和操作数。一旦识别出这是一条加载指令，它就会在相应地址发出内存加载请求。

虚拟内存
在当今你可能为其编程的大多数硬件上，以及在过去数十年的所有苹果平台上，系统都使用虚拟内存（virtual memory）。简而言之，虚拟内存将你的程序所看到的内存地址与计算机实际 RAM 的物理内存地址分离开来。换句话说，当你的程序访问地址 42 时，它实际上可能访问的是物理 RAM 地址 977305。

这种映射是以页面（page）为单位进行的。每个页面是一块 4kB 大小的内存块。如果对内存中每个字节都跟踪虚拟地址映射，开销会过于巨大，因此改为按页面进行映射。页面大小足以提供合理的粒度，同时又足够小，不会给映射维护带来过大负担。

现代虚拟内存系统还具备为页面设置权限的能力。一个页面可以是可读、可写、可执行的，或是这些属性的某种组合。如果程序试图对某个页面进行其权限不允许的操作，或者尝试访问根本没有映射的页面，程序会被暂停，并向操作系统引发一个错误（fault）。随后操作系统可以采取进一步行动，例如终止程序并生成崩溃报告 —— 这就是您常见的 EXC_BAD_ACCESS 错误发生时的过程。

负责这项工作的硬件称为内存管理单元（Memory Management Unit，简称 MMU）。MMU 拦截所有内存访问，并根据当前的页面映射重新映射地址。

当 CPU 加载一个字节的内存时，首要步骤是将地址传递给内存管理单元（MMU）进行地址转换。（并非总是如此。在某些 CPU 上，在 MMU 之前还有一层缓存。然而，总体原理保持不变。）

MMU 处理地址时首先截取最低 12 位，剩下的是一个纯页地址。2 的 12 次方等于 4096，因此最低 12 位描述了该地址在其所属页内的偏移量。一旦地址的剩余部分被重新映射，最低 12 位就可以加上去生成完整的物理地址。

获得页地址后，MMU 会查询转换后备缓冲区（TLB）。TLB 是页映射的缓存。如果相关的页最近被访问过，TLB 会记住该映射，并迅速返回物理页地址。至此，MMU 的工作便完成了。

当 TLB 中没有对应页表项时，称为 TLB 未命中（TLB miss），此时必须通过搜索整个页表来查找对应项。页表是内存中的一段区域，用于描述当前进程中所有的页面映射关系。最常见的处理方式是，操作系统以特殊格式将页表布局在内存中，使内存管理单元（MMU）能直接解析。发生 TLB 未命中后，MMU 会搜索页表寻找对应表项。若找到，则将其加载至 TLB 并执行地址重映射。

在某些架构上，页表映射完全由操作系统负责处理。当发生 TLB 未命中时，CPU 将控制权转交给操作系统，由后者查找映射关系并填充 TLB。这种方式更灵活但速度慢得多，在现代硬件中已较为少见。（译注：现代 x86 / ARM 架构仍广泛采用硬件页表遍历机制）

如果在页表（page table）中未找到对应条目，则表示该地址在 RAM 中根本不存在。CPU 会通知操作系统，由操作系统决定如何处理此情况。若操作系统判定该地址无效，将终止程序并触发 EXC_BAD_ACCESS 错误。某些情况下，操作系统认为该地址有效，但相关数据未加载至 RAM。例如：数据已被换出至磁盘、属于内存映射文件（memory mapped file）的一部分，或是新分配的内存按需提供后备存储。此时操作系统会将相应数据加载至 RAM，在页表中添加条目，随后 MMU（内存管理单元）即可在后备数据就绪后完成虚拟地址（virtual address）到物理地址的转换。

Cache（缓存） 获得地址后，CPU 会查询自身内存缓存（memory cache）。过去 CPU 直接与 RAM 通信，但如今 CPU 速度增速远超内存速度，这种直接访问已不切实际。若现代 CPU 每次访问内存都直接与 RAM 交互，计算机性能将严重下降。

缓存是一种将一组内存地址映射到内存内容的硬件机制。缓存被组织成缓存行（cache line），每条缓存行的大小通常在 32-128 字节之间。缓存中的每个条目都包含一个地址以及与该地址对应的一条缓存行。当从缓存加载数据时，系统会检查请求的地址是否存在于缓存中；若存在，则从该地址的缓存行中返回相应数据。

缓存通常分为多个层级。由于硬件设计的限制，容量越大的缓存速度必然越慢。通过采用多级缓存设计，可以优先检查小型高速缓存，随后再访问较慢的大型缓存，从而避免从随机存取存储器（RAM）直接获取数据的开销。CPU 首先查询的是作为第一级缓存的 L1 缓存。这类缓存容量较小，通常约为 16-64KB。如果 L1 缓存中包含了所需数据，内存加载操作就完成了！不过这样的情况比较理想化，我们这里假设所有缓存都未命中（cache miss），即需要的数据不在缓存中。

接下来是L2 cache（二级缓存）。它的容量更大，通常从 256KB 到数兆字节不等。在某些 CPU 中，L2 缓存是最后一级缓存，因此通常具有更大的容量。其他 CPU 还配备L3 cache（三级缓存），这种情况下 L2 缓存通常较小，并由大型 L3 缓存补充 ——L3 缓存容量通常达数兆字节，部分高性能芯片的 L3 缓存甚至可达 20MB。

当所有级别的缓存都尝试过后，若均未包含所需数据，就需要访问主存（main memory）。由于缓存按完整 ** 缓存行（cache line）** 工作，即使只加载单个字节，整个缓存行也会被一次性从主存加载。这在常见访问邻近内存的场景中大幅提升了效率 —— 因为后续的邻近加载可直接从缓存获取，但代价是在内存使用分散的情况下会浪费时间。

内存现在终于开始查询 **RAM（随机存取存储器）** 了。此时 CPU 已等待了相当一段时间，而且在获取所需数据前还将继续等待更久。

加载操作被移交给 memory controller（内存控制器），这是实际知道如何与 RAM（随机存取存储器）通信的硬件部分。在许多现代硬件上，内存控制器被直接集成到 CPU（中央处理器）中，而在一些系统上，它是一个名为 northbridge（北桥）的独立芯片的一部分（译注：现代系统中，北桥可能已集成或被淘汰）。

然后，内存控制器开始从 RAM 加载数据。现代 SDRAM（同步动态随机存取存储器）一次传输 64 位数据，因此需要多次传输才能填满请求的整个 cache line（缓存行）。

内存控制器将 load address（加载地址）放置在 RAM 的 address pins（地址引脚）上，并等待数据返回。在内部，RAM 使用地址引脚上的值来激活一行 memory cells（内存单元），其内容随后暴露在 RAM 的 output pins（输出引脚）上。

RAM 不是瞬时的，在内存控制器请求地址和数据可用之间存在明显的延迟，当前硬件中大约为 10 纳秒。执行缓存行所需的后续加载需要更多时间，但加载可以 pipelined（流水线化），因此总传输时间可能增加 50%。

当内存控制器（memory controller）从 RAM 获取数据时，会将其交还给缓存（cache），缓存会存储这些数据，以防同一缓存行（cache line）中的其他数据很快被再次需要。最终，被请求的字节会传递给 CPU，CPU 会将该数据放入指令所请求的寄存器（register）中。在经历了所有这些工作之后，CPU 终于可以继续运行需要该字节数据的代码。

影响
这一切的工作方式带来了许多实际影响。特别是，相对而言，内存访问（memory access）是缓慢的。你的计算机能够每秒执行上述操作数千万次，这本身令人惊叹，但它每秒却能执行其他操作数十亿次。一切都是相对的。

假设 TLB（Translation Lookaside Buffer，地址转换后备缓冲器，是 MMU 即内存管理单元中的一种快速缓存）命中，完成上述所有步骤所需的总时间大约为几十纳秒（nanoseconds）。在一个 2GHz 的 CPU 上，这可能相当于大约 50 个时钟周期（clock cycles），在这段时间内理论上可能执行约 150 条指令。这已经很可观了。一次 TLB 未命中可能会使这个延迟时间增加一倍或两倍。

现代 CPU 采用流水线技术和并行化设计。这意味着它们很可能会提前发现内存读取需求，并提前启动加载操作，从而减轻延迟带来的影响。并行执行则意味着 CPU 在等待内存加载时很可能继续执行加载指令之后的代码，特别是那些不依赖加载值的指令。然而，这种能力存在限制，要在等待 RAM 期间找到 150 条可执行的指令是相当困难的。你几乎必然会遇到程序执行必须暂停、等待内存加载完成的情况。

顺便一提，这正是超线程（hyperthreading）技术的优势所在。超线程不让整个 CPU 核心在等待 RAM 时空闲，而是让它切换到完全不同的执行线程（thread），运行来自该线程的代码，从而在等待期间仍能完成有用的工作。

访问模式是性能的关键。关于微观优化（micro-optimization）的讨论往往集中在使用某些特定指令而非其他指令、避免除法运算等方面。相对而言，讨论内存访问模式的内容却较少。然而，如果内存的加载方式对内存系统不友好，那么无论你的单条指令优化得多么精良也无济于事。如果每加载一块新数据都需要等待数十个时钟周期，那么这里节省几个周期、那里节省几个周期就变得毫无意义。例如，正因如此，尽管以下方式更直观自然，但你绝不应该这样编写循环来访问图像数据：

1
    for(int x = 0; x < width; x++)
2
        for(int y = 0; y < height; y++)
3
            // use the pixel at x, y

图像通常按连续行排列，而此循环并未利用这一特性。它遍历列，直到加载完整个第一列后才回到第一行的下一个像素。这会导致缓存（cache）和 TLB（翻译后备缓冲器）未命中。若改为先遍历行再遍历列，此循环的速度将大幅提升：

1
    for(int y = 0; y < height; y++)
2
        for(int x = 0; x < width; x++)
3
            // use the pixel at x, y

在许多情况下，包含慢速代码的底层循环的性能会远超包含快速代码的顶层循环，原因很简单：内存访问延迟带来的代价可能极其高昂。

更糟糕的是，诸如 Instruments 中的 Time Profiler 之类的性能分析器，并不擅长揭示这些延迟。它们会告诉你哪些指令消耗了时间，但由于现代 CPU 具有流水线化和并行的特性，实际承受内存加载代价的指令，可能并不是那条加载指令本身。CPU 会遇到加载指令，将其目标寄存器（destination register）标记为数据尚未就绪，然后继续执行。当 CPU 遇到一条真正需要使用该寄存器值的指令时，它才会停下来等待。这里的线索在于，当对同一值进行连续操作的序列中，第一条指令所花的时间远超后续指令，也远超其应有的耗时。例如，如果你的代码执行了加载、加法、乘法、加法操作，而性能分析器显示第一次加法占用了绝大部分时间，那么这很可能源于内存延迟，而并非真的是一条缓慢的加法指令。

结论现代计算机在难以想象的时间尺度上运行。对人类来说，单个 CPU 周期所需的时间和执行一次硬盘寻道所需的时间都近乎瞬间完成，但它们之间却相差了多个数量级。计算机是一个极其复杂的系统，为了从内存中加载一个数据块，需要完成大量操作。了解在此过程中硬件层面发生的一切是引人入胜的，甚至能帮助我们写出更好的代码。更令人惊叹的是，当你意识到这套复杂的操作，就在你阅读这段文字所用的计算机中，每秒真实地发生着数百万次。

今天的分享就到这里。下次再见，我们将继续探索更多超乎寻常的技术领域。如果您还不知道的话，Friday Q & A 的内容由读者投稿驱动。这里的” 读者” 就是指您，所以，如果您有任何希望看到的话题，请发送给我们。

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2012-12-28-what-happens-when-you-load-a-byte-of-memory.html

The hardware and software that our apps run on is almost frighteningly complicated, and there’s no better place to see that than in the contortions that the system goes through when we load data from memory. What exactly happens when we load a byte of memory? Reader and friend of the blog Guy English suggested I dedicate an article to answering that question.

CodeLet’s start with the code that loads the byte of memory. In C, it would look something like this:

1
    char *addr = ...;
2
    char value = *addr;

On x86-64, this compiles to something like:

1
    movsbl (%rdi),%eax

This instructs the CPU to load the byte located at the address stored in %rdi into the %eax register. On ARM, the compiler produces:

1
    ldrsb.w r0, [r0]

Although the instruction name is different, the effect is basically the same. It loads the byte located at the address stored in r0, and puts the value into r0. (The compiler is reusing r0 here, since the address isn’t needed anymore.)

Now that the CPU has its instruction, the software is done. Well, maybe.

Instruction Decoding and ExecutionI don’t want to go too in depth with how the CPU actually executes code in general. In short, the CPU loads the above instruction from memory and decodes it to figure out the opcode and operands. Once it sees that it’s a load instruction, it issues the memory load at the appropriate address.

Virtual MemoryOn most hardware you’re likely to program for today, and on any Apple platform from the past couple of decades, the system uses virtual memory. In short, virtual memory disconnects the memory addresses seen by your program from the physical memory addresses of the actual RAM in your computer. In other words, when your program accesses address 42, that might actually access the physical RAM address 977305.

This mapping is done by page. Each page is a 4kB chunk of memory. The overhead of tracking virtual address mappings for every byte in memory would be far too great, so pages are mapped instead. They’re small enough to provide decent granularity, but large enough to not incur too much overhead in maintaining the mapping.

Modern virtual memory systems also have the ability to set permissions on a page. A page may be readable, writeable, or executable, or some combination thereof. If the program tries to do something with a page that it isn’t allowed, or tries to access a page that has no mapping at all, the program is suspended and a fault is raised with the operating system. The OS can then take further action, such as killing the program and generating a crash report, which is what happens when you experience the common EXC_BAD_ACCESS error.

The hardware that handles this work is called the Memory Management Unit, or MMU. The MMU intercepts all memory accesses and remaps the address according to the current page mappings.

The first thing that happens when the CPU loads a byte of memory is to hand the address to the MMU for translation. (This is not always true. On some CPUs, there is a layer of cache that comes before the MMU. However, the overall principle remains.)

The first thing the MMU does with the address is slice off the bottom 12 bits, leaving a plain page address. 212 equals 4096, so the bottom 12 bits describe the address’s location within its page. Once the rest of the address is remapped, the bottom 12 bits can be added on to generate the full physical address.

With the page address in hand, the MMU consults the Translation Lookaside Buffer, or TLB. The TLB is a cache for page mappings. If the page in question has been accessed recently, the TLB will remember the mapping, and quickly return the physical page address, at which point the MMU’s work is done.

When the TLB does not contain an entry for the given page, this is called a TLB miss, and the entry must be found by searching the entire page table. The page table is a chunk of memory that describes every page mapping in the current process. Most commonly, the page table is laid out in memory by the OS in a special format that the MMU can understand directly. Following a TLB miss, the MMU searches the page table for the appropriate entry. If it finds one, it loads it into the TLB and performs the remapping.

On some architectures, the page table mapping is left entirely up to the OS. When a TLB miss occurs, the CPU passes control to the OS, which is then responsible for looking up the mapping and filling the TLB with it. This is more flexible but much slower, and isn’t found much in modern hardware.

If no entry is found in the page table, that means the given address doesn’t exist in RAM at all. The CPU informs the OS, which then decides how to handle the situation. If the OS doesn’t think that address is valid, it terminates the program and you get an EXC_BAD_ACCESS. In some cases, the OS does think the address is valid, but just doesn’t have the data in RAM. This can happen if the data has been swapped out to disk, is part of a memory mapped file, or is freshly allocated with backing memory being provided on demand. In these cases, the OS loads the appropriate data into RAM, adds an entry to the page table, and then lets the MMU translate the virtual address into a physical address now that the backing data is available.

CacheWith the address in hand, the CPU consults its memory cache. In days of yore, the CPU would talk directly to RAM. However, CPU speeds have increased faster than memory speeds, and that’s no longer practical. If a modern CPU had to talk directly to modern RAM for every memory access, our computers would slow to a relative crawl.

The cache is a hardware map from a set of memory addresses to memory contents. Caches are organized into cache lines, which are typically in the region of 32-128 bytes each. Each entry in the cache holds an address and a single cache line corresponding to that address. When loading data from the cache, it checks to see if the requested address exists in the cache, and if so, returns the appropriate data from that address’s cache line.

There are typically several levels of cache. Due to hardware design constraints, larger caches are necessarily slower. By having multiple levels, a small, fast cache can be checked first, with slower, larger caches used later to avoid the cost of fetching from RAM. The CPU first checks with the L1 cache, which is the first level. This cache is small, typically around 16-64kB. If it contains the data in question, then the memory load is complete! Since that’s boring, we’ll assume the caches don’t contain the data being loaded here.

Next up is the L2 cache. This is bigger, generally anywhere from 256kB to several megabytes. In some CPUs, the L2 cache is the last level, and these typically have larger L2 caches. Other CPUs have an L3 cache as well, in which case the L2 is usually smaller, and it’s supplemented by a large L3 cache, usually several megabytes, with some high performance chips having up to 20MB of L3 cache.

Once all levels of cache have been tried, if none of them contain the necessary data, it’s time to try main memory. Because caches work with entire cache lines, the entire cache line is loaded from main memory at once, even though we’re only loading a single byte. This greatly increases efficiency in the common case of accessing other nearby memory, since subsequent nearby loads can come from cache, at the cost of wasting time when memory use is scattered.

MemoryIt’s finally time to start querying RAM. The CPU has been waiting quite a while by this point, and will have to wait a long time more before it gets the data it wants.

The load is handed off to the memory controller, which is the bit of hardware that actually knows how to talk to RAM. On a lot of modern hardware, the memory controller is integrated directly into the CPU, while on some systems it’s part of a separate chip called the “northbridge”.

The memory controller then starts loading data from RAM. Modern SDRAM transfers 64 bits of data at a time, so several transfers have to be done to fill the entire cache line being requested.

The memory controller places the load address on the address pins of the RAM and waits for the data to be returned. Internally, the RAM uses the values on the address pins to activate a row of memory cells, whose contents are then exposed on the RAM’s output pins.

RAM is not instantaneous, and there’s an appreciable delay between when the memory controller requests an address and when the data is available, on the order of 10 nanoseconds in current hardware. It takes more time to perform the subsequent loads needed for the cache line, but the loads can be pipelined, so total transfer time is maybe 50% more.

As the memory controller obtains data from RAM, it hands that data back to the caches, which store it in case other data from the same cache line is needed soon. Finally, the requested byte is handed to the CPU, which places the data into the register requested by the instruction. At last, after all of this work, the CPU can get on with running the code that needed that byte of data.

ConsequencesThere are a lot of practical consequences that result from how all of this stuff works. In particular, memory acccess is slow, relatively speaking. It’s amazing that your computer can do all of the above work literally tens of millions of times per second, but it can do other things literally billions of times per second. Everything is relative.

The total time required for all of this, assuming a TLB hit (the fast case for the MMU) is a couple of dozen nanoseconds. On a 2GHz CPU, that could mean something like 50 clock cycles with the potential to execute perhaps 150 instructions in that time. That’s a lot. A TLB miss may double or triple this latency number.

Modern CPUs are pipelined and parallelized. This means that they will likely see the need for the memory read ahead of time and initiate the load at that point, softening the blow. Parallel execution means that the CPU will probably be able to continue executing some code after the load instruction while waiting for the load, especially code that doesn’t depend on the loaded value. However, this stuff has limits, and finding 150 instructions that can be executed while waiting for RAM is a tall order. You’re almost certain to hit a point where program execution has to stop and wait for the memory load to complete.

Incidentally, this is where hyperthreading gains its advantage. Instead of having an entire CPU core just idle while waiting for RAM, hyperthreading lets it switch over to a completely different thread of execution and run code from that instead, so that it can still get useful work done while it waits.

Access patterns are key to performance. Discussions about micro-optimization tend to center on using some instructions rather than others, avoiding divisions, etc. Relatively few talk about memory access patterns. However, it doesn’t matter how optimized your individual instructions are if they’re operating on memory that’s loaded in a way that isn’t kind to the memory system. Saving a few cycles here and there is meaningless if you’re waiting dozens of cycles for every new piece of data to load. For example, this is why, although it’s the more natural way to express it, you should never write loops to access image data like this:

1
    for(int x = 0; x < width; x++)
2
        for(int y = 0; y < height; y++)
3
            // use the pixel at x, y

Images are typically laid out in contiguous rows, and this loop does not take advantage of that fact. It accesses columns, only coming back to the next pixel in the first row after loading the entire first column. This causes cache and TLB misses. This loop will be vastly slower than if you iterate over rows first, then columns:

1
    for(int y = 0; y < height; y++)
2
        for(int x = 0; x < width; x++)
3
            // use the pixel at x, y

In many cases, the top loop with fast code in the loop body will be massively outperformed by the bottom loop with slow code in the loop body, simply because memory access delays can be so punishing.

To make things even worse, profilers, such as Apple’s Time Profiler in Instruments, aren’t good at showing these delays. They’ll tell you what instructions took time, but because of the pipelined, parallel nature of modern CPUs, the instruction that takes the hit of the memory load may not be the actual load instruction. The CPU will hit the load instruction, mark its destination register as not having its data yet, and move on. When the CPU hits an instruction that actually needs that register’s value, then it will stop and wait. The clue here is when the first instruction in a sequence of manipulations on the same value takes far longer than the rest, and far longer than it should. For example, if you have code that does load, add, mul, add, and the profiler says that the first add takes the vast majority of the time, this is likely to be a memory delay, not actually a slow add.

ConclusionModern computers operate on time scales that are difficult to envision. To a human, the time required for a single CPU cycle and the time required to perform a hard disk seek are both indistingusihably instantaneous, yet they vary by many orders of magnitude. The computer is an incredibly complicated system that requires a huge number of things to happen in order to load a single chunk of data from memory. Knowing what goes on in the hardware when this happens is fascinating and can even help write better code. It’s even more incredible once you think that this complicated set of operations happens literally millions of times every second in the computer you’re using to read this.

That’s it for today. Check back next time for another exploration of the trans-mundane. If you somehow didn’t already know, Friday Q&A is driven by reader submissions. By “reader” I mean you, so if you have a topic that you’d like to see covered, please send it in.