为什么寄存器快而内存慢

Mike Ash Friday Q&A 中文译文:为什么寄存器快而内存慢

作者 TommyWu
封面圖片: 为什么寄存器快而内存慢

译文 · 原文: Friday Q&A 2013-10-11: Why Registers Are Fast and RAM Is Slow · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html 发布:2013-10-11 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


在前一篇关于 ARM64 的文章中,我提到新架构的一个优势是寄存器数量翻倍,使得代码能更少从内存(RAM)加载数据 —— 而内存访问要慢得多。读者 Daniel Hooper 自然提出了问题:为什么内存比寄存器慢那么多?

距离
我们先从距离开始分析。这未必是最大的影响因素,但分析起来最有趣。内存比寄存器离 CPU 更远,这可能导致数据获取耗时更长。

以一个 3GHz 处理器为例:光速大约每纳秒传播一英尺(约合 30 厘米)。该处理器单个时钟周期内,光只能传播约 4 英寸。这意味着往返信号最多只能触及两英寸(约 5 厘米)远的部件 —— 而这还是在硬件完美、能以真空中光速传输信息的前提下。对于台式机而言,这影响相当显著。但对 iPhone 而言则重要性低得多:其时钟频率低得多(iPhone 5S 运行频率为 1.3GHz),且内存紧邻 CPU。

成本
尽管我们可能不希望如此,但成本始终是一个考量因素。在软件领域,当试图让程序运行更快时,我们不会对整个程序给予同等关注。相反,我们会识别出对性能最关键的热点区域(hotspots),并给予它们最多的关注。这能让我们有限的资源发挥最大效用。硬件也是如此。更快的硬件更昂贵,而这些开销最好花在能产生最大差异的地方。

寄存器(Registers)使用频率极高,但数量却不多。在 A7 芯片中,寄存器数据总共只有大约 6,000 位(32 个 64 位通用寄存器加上 32 个 128 位浮点寄存器,以及其他一些杂项寄存器)。而一部 iPhone 5S 的 RAM 大约有 80 亿位(1GB)。花大力气提升每个寄存器位的速度是值得的。而 RAM 的位数确实多出百万倍,如果你想让手机售价是 650 美元而不是 6,500 美元,这 80 亿位数据基本上必须尽可能廉价。

寄存器采用昂贵的设计以实现快速读取。读取一个寄存器位,只需激活对应的晶体管,然后等待极短时间让寄存器硬件将读取线路驱动至相应状态。

相比之下,读取一个 RAM 位的过程更为复杂。在任何智能手机或 PC 中的 DRAM(动态随机存取存储器)里,一个位由一个电容和一个晶体管组成。如您所料,这些电容极其微小,毕竟您口袋里就能装下八十亿个这样的电容。这意味着它们存储的电荷量非常少,从而难以测量。我们习惯认为数字电路处理的是 1 和 0,但在此模拟世界介入了。读取线路会被预充电到介于 1 和 0 之间的水平。然后连接该电容,这会增加或消耗微量的电荷。放大器用于将电荷推向 0 或 1 的方向。一旦线路中的电荷被充分放大,便可返回结果。

RAM 位仅由一个晶体管和一个微型电容构成,这使其制造成本极其低廉。而寄存器位包含更多部件,因此成本要高得多。

此外,确定与 RAM 通信所需的硬件本身就更为复杂,因为 RAM 容量大得多。从寄存器(register)读取数据的过程如下:

  • 从指令中提取相关位。
  • 将这些位发送到寄存器文件的读取线路上。
  • 读取结果。

而从 RAM 读取数据的过程则是:

  • 获取要加载数据的指针。(该指针很可能存储在某个寄存器中。这一步已经包含了上述所有操作!)
  • 将该指针发送到 MMU(内存管理单元)。
  • MMU 将指针中的虚拟地址转换为物理地址。
  • 将物理地址发送给内存控制器(memory controller)。
  • 内存控制器确定数据位于 RAM 的哪个存储单元(bank),并向 RAM 发出请求。
  • RAM 确定数据所在的特定数据块(chunk),并向该数据块发出请求。
  • 第 6 步可能重复多次,逐步缩小范围直到定位到单个单元阵列。
  • 从阵列中加载数据。
  • 将数据发送回内存控制器。
  • 将数据发送回 CPU。
  • 使用数据!

呼。

应对慢速 RAM 这总结了为什么 RAM 要慢得多。但 CPU 如何应对这种慢速呢?一次 RAM 加载是单个 CPU 指令,但它可能需要花费数百个 CPU 周期才能完成。CPU 是如何处理这个问题的?

首先,CPU 执行单条指令到底需要多长时间?很容易想当然地认为单条指令在一个周期内就能执行完,但现实当然要复杂得多。

在过去的好时光里,当男人们还自豪地穿着羊皮衫,国家在战争中未尝败绩时,这个问题还不难回答。虽然不是一指令一周期,但至少存在某种清晰的对应关系。例如,Intel 4004 执行一条指令需要 8 或 16 个时钟周期(译注:Intel 4004 是早期的 4 位 CPU,现代处理器指令执行机制已复杂得多),具体取决于指令是什么。简单明了。后来情况逐渐复杂,不同指令的时序千差万别。较旧的 CPU 手册会列出每条指令的执行耗时。

现在呢?就不那么简单了。

随着时钟频率的提升,另一个长期的发展方向是增加每个时钟周期内可以执行的指令数量。在过去,每个时钟周期大约只能执行 0.1 条指令。如今,在理想情况下这个数字已经提升到 3 到 4 条左右。这是如何实现的呢?当单个芯片上拥有十亿甚至更多晶体管时,就能加入许多智能设计。虽然 CPU 可能在一个时钟周期内执行 3 到 4 条指令,但这并不意味着每条指令只需四分之一个时钟周期就能执行完毕。实际上每条指令至少仍需要一个周期,通常甚至更多。真正的情况是,CPU 能够在任意时刻维持多条指令在执行过程中(in flight)。每条指令可以分解为若干阶段:加载指令、译码以确定其含义、获取输入数据、执行计算、存储输出数据。这些阶段都可以在不同的时钟周期内进行。

在任意给定的 CPU 时钟周期内,CPU 会同时执行多项操作:

  • 一次取指(fetch)多条指令。

  • 对另一组完全不同的指令进行译码(decode)。

  • 为又一组不同的指令获取数据。

  • 为后续指令执行计算。

  • 为后续指令存储数据。

但是,你可能会问,这怎么可能实现呢?例如:

add x1, x1, x2
add x1, x1, x3

这怎么可能那样并行执行呢!必须完成第一条指令才能开始第二条!

确实,这不可能奏效。但聪明之处就在于此。CPU 能够分析指令流(instruction stream),找出指令之间的依赖关系并重新安排执行顺序。例如,如果这两条加法指令之后的某条指令不依赖于它们,CPU 就可能提前执行该指令 —— 尽管它在指令流中的位置更靠后。要实现每时钟周期 3-4 条指令的理想吞吐量,只能在包含大量独立指令的代码中达成。

当遇到内存加载指令时会发生什么?首先,相对而言这绝对会耗时漫长。如果极其幸运,目标数值就在 L1 缓存(L1 cache)中,可能仅需几个时钟周期。若运气不佳需要直达主内存(main RAM)寻找数据,可能需要数百个时钟周期。期间可能产生大量空转等待。(译注:现代 CPU 的缓存层级和预测机制可能已显著优化此类延迟)

CPU 不会坐以待毙,因为那样效率低下。首先,它会尝试预测 —— 或许能提前识别出那条加载指令,推算其将要加载的内容,并在真正执行该指令前就提前发起数据加载。其次,在等待期间,只要条件允许,它会继续执行其他指令。若加载指令之后存在不依赖待加载数据的指令,这些指令仍可被执行。最后,当所有能执行的指令都已完成,而前方指令又绝对依赖于尚未就绪的数据时,CPU 别无选择,只能进入停顿状态,等待数据从 RAM 返回。

总结

  • RAM 之所以慢,是因为其规模极其庞大。
  • 这意味着必须采用成本更低的设计方案,而成本更低往往也意味着速度更慢。
  • 现代 CPU 内部会进行各种疯狂操作,并乐于以与代码呈现顺序截然不同的方式执行你的指令流。
  • 这意味着 CPU 在等待 RAM 加载时,首先会执行其他代码。
  • 如果所有尝试都失败,它就只能停下来,等,等,等,一直等下去。

今天的讨论到此结束。一如既往,Friday Q & A(周五问答)栏目由读者建议驱动,因此如果有你希望看到的主题或想得到解答的问题,请发送给我们!


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html

In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers?

DistanceLet’s start with distance. It’s not necessarily a big factor, but it’s the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.

Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that’s two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that’s pretty significant. However, it’s much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.

CostMuch as we might wish it wasn’t, cost is always a factor. In software, when trying to make a program run fast, we don’t go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it’ll make the most difference.

Registers get used extremely frequently, and there aren’t a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It’s worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a 650phoneinsteadofa650 phone instead of a 6,500 phone.

Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.

Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you’d expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that’s halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.

The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.

There’s also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there’s so much more of it. Reading from a register looks like:

  • Extract the relevant bits from the instruction.

  • Put those bits onto the register file’s read lines.

  • Read the result.

Reading from RAM looks like:

  • Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)

  • Send that pointer off to the MMU.

  • The MMU translates the virtual address in the pointer to a physical address.

  • Send the physical address to the memory controller.

  • Memory controller figures out what bank of RAM the data is in and asks the RAM.

  • The RAM figures out particular chunk the data is in, and asks that chunk.

  • Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.

  • Load the data from the array.

  • Send it back to the memory controller.

  • Send it back to the CPU.

  • Use it!

Whew.

Dealing With Slow RAMThat sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?

First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.

Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn’t one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.

Now? Not so simple.

Along with increasing clock rates, there’s also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it’s up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn’t mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.

On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:

  • Fetching potentially several instructions at once.

  • Decoding potentially a completely different set of instructions.

  • Fetching the data for potentially yet another different set of instructions.

  • Performing computations for yet more instructions.

  • Storing data for yet more instructions.

But, you say, how could this possibly work? For example:

add x1, x1, x2
add x1, x1, x3

These can’t possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!

It’s true, that can’t possibly work. That’s where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn’t depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.

What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you’re really lucky and the value is in L1 cache, it’ll only take a few cycles. If you’re unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.

The CPU will try not to twiddle its thumbs, because that’s inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it’s going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don’t depend on the data being loaded, they can still be executed. Finally, once it’s executed everything it can and it absolutely cannot proceed any further without that data it’s waiting on, it has little choice but to stall and wait for the data to come back from RAM..

Conclusion

  • RAM is slow because there’s a ton of it.

  • That means you have to use designs that are cheaper, and cheaper means slower.

  • Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that’s wildly different from how it appears in the code.

  • That means that the first thing a CPU does while waiting for a RAM load is run other code.

  • If all else fails, it’ll just stop and wait, and wait, and wait, and wait.

That wraps things up for today. As always, Friday Q&A is driven by reader suggestions, so if you have a topic you’d like to see covered or a question you’d like to see answered, send it in!