ARM64 与你 | TommyWu's Lab

文章發布時間 2013年9月27日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2013-09-27: ARM64 and You · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html 发布：2013-09-27　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

自从几周前 iPhone 5S 发布以来，科技新闻界就充斥着大量错误信息。遗憾的是，获取准确信息需要时间，而科技新闻行业更看重速度而非准确性。今天，应众多读者建议，我将从性能、功能和开发角度，为你详解 iPhone 5S 中 64 位 ARM 处理器的意义。

“64-bit”（64 位） 我们先来谈谈” 64-bit”（64 位）这个通用术语及其含义。这个概念存在很多混淆，主要因为其定义并未完全统一。不过，人们对此通常已形成某些共识。

CPU 中常被称为” X-bit”（X 位）的部分通常指代两个方面：整数寄存器的宽度（width of the integer registers）和指针的宽度（width of pointers）。幸运的是，在大多数现代 CPU 中，这两种宽度是相同的。因此” 64-bit” 通常意味着 CPU 拥有 64 位整数寄存器和 64 位指针。

同样重要的是要指出” 64-bit” 不涵盖的范围，因为这方面也存在大量误解。具体而言，“64-bit” 不包括：

物理 RAM 地址大小。实际用于与 RAM 通信的位数（即硬件可支持的 RAM 容量）与 CPU 位宽问题是解耦的。ARM CPU 的地址位宽从 26 位到 40 位不等，且这一参数可独立于其他部件进行调整。
数据总线宽度。从 RAM 或缓存中获取的数据量同样与 CPU 位宽解耦。单个 CPU 指令可能请求特定数据量，但实际获取的数据量可以是独立的，要么通过将获取操作拆分为更小部分，要么获取超出需求的数据。iPhone 5 已经以 64 位数据块为单位从内存中获取数据，而 PC 领域甚至存在高达 192 位的缓存行（译注：现代 CPU 架构中此概念通常对应缓存行大小）。
任何与浮点运算相关的参数。FPU（浮点运算单元）的寄存器大小和内部设计是独立的，ARM CPU 早在 ARM64 架构之前就已经配备了 64 位浮点寄存器。

通用优势与劣势
如果我们比较其他方面完全相同的 32 位和 64 位 CPU，两者差异并不显著 —— 这正是围绕苹果转向 64 位 ARM 架构意义的诸多困惑的主要原因。这一转变确实重要，但主要源于 ARM 处理器的具体特性及其在苹果产品中的应用方式。

不过，两者间仍存在一些差异。最明显的或许是 64 位整数寄存器（64-bit integer registers）让处理 64 位整数变得更为高效。在 32 位处理器上仍可操作 64 位整数，但这通常需要将其拆分为两个 32 位部分来处理，这意味着算术运算耗时会显著增加。而 64 位 CPU 通常能以处理 32 位数据相同的速率执行 64 位数据的算术运算，因此涉及大量 64 位整数操作的代码运行速度将大幅提升。

尽管「64-bit（64 位）」对「CPU（中央处理器）」自身能使用的「RAM（随机存取存储器）」数量没有直接影响，但它可以使单个程序更容易地使用大量 RAM。一个在「32-bit（32 位）」CPU 上运行的单个程序只有 4GB 的「address space（地址空间）」。该地址空间的一部分被「operating system（操作系统）」和「standard libraries（标准库）」等占用，通常留下 1-3GB 可供使用。如果一个 32 位系统有超过 4GB 的 RAM，从单个程序利用所有这些 RAM 是很困难的。你必须采取一些手段，比如请求操作系统根据需要将内存块映射到你的「process（进程）」中或从中移出，或者将你的程序拆分成多个进程。这需要大量的额外编程努力，并可能降低性能，因此很少有程序实际这样做。实际上，一个 32 位 CPU 限制每个程序只能使用 1-3GB 的 RAM，而拥有更多 RAM 的优势在于能够同时运行多个此类程序，以及能够从磁盘缓存更多数据。这仍然是有用的，但在某些情况下，单个程序使用更多 RAM 的能力是必要的。

增大的地址空间即使在没有那么大内存的系统上也很有用。内存映射文件（memory-mapped files）是一种方便的构建方式，文件内容逻辑上映射到进程的内存空间中，即使整个文件不一定都分配物理内存。在 32 位系统上，程序无法可靠地内存映射大型文件（比如超过几百兆字节）。而在 64 位系统上，可用的地址空间大得多，因此无需担心空间耗尽。

指针尺寸增大带来一个显著的负面影响：在其他方面完全相同的程序，在 64 位 CPU 上运行时会占用更多内存，甚至可能多出很多。指针本身也需要存储在内存中，每个指针占用的内存大小翻倍。在大多数程序中指针非常常见，因此这会带来相当大的差异。内存使用的增加会给缓存（caches）带来更大压力，从而导致性能下降。

简而言之：64 位可以提升特定类型代码的性能，并使得像内存映射文件这样的编程技术更可行。然而，由于内存使用增加，它也可能降低性能。

ARM64

iPhone 5S 的 64 位 CPU 并不仅仅是寄存器位宽更广的常规 ARM 处理器。64 位 ARM 架构相较于 32 位版本进行了重大变革。

首先，关于命名做个说明：ARM 官方名称为 “AArch64”，但这个名字实在糟糕，打字都让我痛苦。苹果称之为 ARM64，我也将沿用此称呼。

ARM64 的整数寄存器（integer registers）数量比 32 位 ARM 翻了一倍。32 位 ARM 提供 16 个整数寄存器，其中一个是专用的程序计数器（program counter），另外两个分别用于栈指针（stack pointer）和链接寄存器（link register），其余 13 个可供通用。而 ARM64 拥有 32 个整数寄存器，包含一个专用零寄存器（zero register）、一个链接寄存器和一个帧指针寄存器（frame pointer register）。另有 1 个寄存器专为平台保留，剩下 28 个为通用目的整数寄存器（general purpose integer registers）。

ARM64 架构也增加了可用浮点寄存器的数量。32 位 ARM 的浮点寄存器设计有些特殊，因此难以直接比较。它拥有 32 个 32 位浮点寄存器，这些寄存器也可视为 16 个相互重叠的 64 位寄存器，此外还有 16 个独立的 64 位寄存器。总共 32 个 64 位寄存器还可进一步视为 16 个相互重叠的 128 位寄存器。ARM64 对此进行了简化，提供 32 个 128 位寄存器，这些寄存器也可用于存储较小的数据类型，且不存在重叠情况。

寄存器数量会显著影响性能。与 CPU 相比，内存速度极慢，读写内存所需时间远超 CPU 处理一条指令的时间。CPU 尝试通过多层缓存来掩盖这种延迟，但即便是最高速的缓存层，与 CPU 内部寄存器相比仍然较慢。更多的寄存器意味着更多数据可以完全保存在 CPU 内部，从而减少内存访问次数并提升性能。

这会产生多大的差异，取决于具体的代码实现，以及编译器在优化代码以充分利用可用寄存器（register）方面的效能。当英特尔（Intel）架构从 32 位升级到 64 位时，寄存器的数量从 8 个翻倍至 16 个，这带来了显著的性能提升。ARM 架构原本就比 32 位英特尔架构拥有更多的寄存器，因此额外增加寄存器带来的影响较小，但这仍然是一项有益的改进。

ARM64 除了增加寄存器数量外，还在指令集方面带来了一些重大变化。

大多数 32 位 ARM 指令可以基于执行时条件寄存器（condition register）的状态有条件地执行。这允许在编译 if 语句及类似结构时无需分支跳转。这种设计本意是为了提升性能，但 ARM64 取消了条件执行，想必是因此导致的弊端大于其带来的收益。

ARM64 的 NEON SIMD 单元提供了完整的 IEEE754 双精度（double-precision）支持，而 32 位版本的 NEON 仅支持单精度（single-precision），并省略了 IEEE754 规范中一些更复杂、更冷僻的部分。

ARM64 增加了用于 AES 加密（AES encryption）和 SHA-1 及 SHA-256 密码学哈希（cryptographic hashes）的专用指令。这在一般情况下并不重要，但如果你恰好在做这些事情，可能会带来巨大的性能提升。

总体而言，迄今为止最重要的变化是通用寄存器（general-purpose registers）数量的大幅增加，以及 NEON 指令集（NEON）中对完整 IEEE754 兼容的双精度算术（double-precision arithmetic）的支持。这些变化可以在许多代码中带来相当大的性能提升。

32 位兼容性，重要的是要注意，A7 芯片包含一个完整的 32 位兼容模式（32-bit compatibility mode），允许运行正常的 32 位 ARM 代码而无需任何更改和模拟（emulation）。这意味着 iPhone 5S 可以毫无问题地运行旧的 iPhone 应用，且与其他硬件相比没有性能影响。32 位代码的性能可能会有所降低，因为它无法获得 ARM64 的任何优势。

在 Mac OS X 10.7 中，Apple 引入了 tagged pointers（标记指针）。标记指针允许某些实例数据量较小的类完全存储在指针内部。这可以消除许多情况下为 NSNumber 等类分配内存的需求，并可能带来显著的性能提升。标记指针最初仅在 64 位系统上支持，部分原因是出于二进制兼容性的考虑，但也有部分原因是：在扣除标记位后，32 位指针剩余空间不足以容纳实际数据。或许是由于这个原因，iOS 从未引入标记指针。然而，在 ARM64 架构上，Objective-C 运行时包含了标记指针，并带来了与 Mac 系统相同的全部优势。

虽然指针有 64 位，但并非所有位都被实际使用。例如，x86-64 架构上的 Mac OS X 仅使用指针的 47 位。而 iOS 在 ARM64 上使用的位数更少，目前仅使用指针的 33 位。只要在指针使用前将多余的位屏蔽掉，这些位就可以用于存储其他数据。这导致了 Objective-C 运行时语言历史上最重大的内部变革之一。

被重新利用的 isa 指针（本节大部分信息来源于 Greg Parker 关于相关变更的文章。如需了解直接来自源文件的信息，请查阅该文章。）

首先快速回顾一下：Objective-C 对象是连续的内存块。该内存中第一个指针大小的部分就是 isa（isa 指针）。传统上，isa 是一个指向对象类的指针。如需了解对象在内存中如何布局的更多信息，请参阅我关于 Objective-C 运行时的文章。

为 isa 指针使用一个完整的指针大小的内存有点浪费，尤其是在 64 位 CPU 上，它们并不使用指针的全部 64 位。目前运行 iOS 的 ARM64 架构仅使用指针的 33 位，剩下 31 位可用于其他目的。类指针也是对齐的（aligned），这意味着类指针保证能被 8 整除，这又释放了另外三位，使得 isa 中共有 34 位可用于其他用途。苹果的 ARM64 运行时巧妙地利用了这一点，实现了一些巨大的性能提升。

很可能最重要的性能改进是内联引用计数（inline reference count）。几乎所有 Objective-C 对象都是引用计数的（例外情况是常量对象，如 NSString 字面量），而用于修改引用计数的保留/释放操作（retain/release）发生得极其频繁。在 ARC（自动引用计数）下尤其如此，其生成的 retain/release 调用甚至比典型的人类程序员编写的更多。因此，高性能的 retain 和 release 操作至关重要。

传统上，引用计数并不存储在对象自身中。如果 isa 是每个对象共有的唯一字段，那么根本没有空间容纳任何额外数据。理论上可以为每个对象增加一个引用计数字段，但这会消耗大量内存。虽然这一点在今天已不那么重要，但在 Objective-C 发展的早期阶段却是个相当大的问题。正因如此，引用计数被存储在一个外部表中。

每当一个对象被保留（retain）时，运行时会执行以下流程：

获取一个全局的引用计数哈希表（retain count hash table）。
锁定该表以确保操作的线程安全。
在该表中查找该对象的引用计数。
递增计数并将新值存回哈希表。
释放哈希表锁。

这个过程有点慢！虽然用于跟踪引用计数的哈希表实现（相对于一般哈希表而言）速度很快，但即便最好的哈希表与直接内存访问相比仍然较慢。

在 ARM64 架构上，isa 字段中有 19 位用于内联存储对象的引用计数。这意味着保留（retain）对象的过程可以简化为：

对 isa 字段的相应部分执行原子递增。

仅此而已！这个操作应该会快得多。

当然，实际实现比这稍微复杂一些，因为需要处理一些边界情况。真正的代码看起来更像这样：

isa 的最低位指示该类是否启用了所有这些额外数据。如果没有启用，则回退到旧的哈希表方案。这为超出可表示范围的类，或错误假设 isa 是纯类指针的程序提供了兼容模式。
如果对象当前正在释放，则不做任何操作。
递增引用计数，但暂不存回 isa 字段。
如果发生溢出（尽管只有 19 位可用空间，这种情况不常见但确实存在），则回退到使用哈希表。
对新的 isa 值执行原子存储操作。

旧方法中大部分步骤也是必要的，且不会增加太多开销。新方法仍将显著更快。

剩余的空闲位中还塞入了多项其他性能改进，能加速对象释放过程。当 Objective-C 对象被释放时，可能需要大量清理工作，而能够跳过不必要的清理步骤可提升性能。这些检查项包括：

对象是否曾通过objc_setAssociatedObject设置过关联对象。如果没有，则无需清理关联对象。
对象是否包含 C++ 析构方法（该方法同时用作 ARC 自动释放方法）。如果没有，则无需调用该方法。
对象是否曾被__weak变量引用过。如果有，则需要将所有剩余的__weak引用置零。如果没有，则可跳过此步骤。

此前，所有这些标志都是按类（per-class）追踪的。例如，如果某个类的任何实例曾被设置过关联对象（associated object），那么从那以后该类的每个实例在释放时都会执行关联对象清理。现在改为按每个实例独立追踪，这有助于确保只有那些真正需要清理的实例才会承担性能开销。

将所有这些改进综合起来，带来的收益相当可观。我随意的基准测试表明，在以 32 位模式运行的 iPhone 5S 上，基本的对象创建和销毁操作耗时约 380 纳秒，而在 64 位模式下仅需约 200 纳秒。如果某个类的任何实例曾被设置过弱引用（weak reference）和关联对象，32 位模式的耗时会上升到约 480 纳秒，而对于那些本身不是目标的实例，64 位模式的耗时仍保持在约 200 纳秒左右。

简而言之，Apple 运行时的改进使得 64 位模式下的对象分配成本仅为 32 位模式下的 40-50%。如果你的应用需要创建和销毁大量对象，这将带来巨大的提升。

结论所谓的” 64 位” A7 芯片并非营销噱头，但也并非能催生全新应用类型的革命性突破。事实往往介于两者之间。

单纯迁移到 64 位架构本身收效甚微。某些场景下计算速度略有提升，多数程序的内存占用有所增加，部分编程技术因此更具可行性。总体而言，其影响并非举足轻重。

ARM 架构在向 64 位过渡的过程中同步调整了诸多设计。寄存器数量的增加与精简优化的指令集，使 64 位 ARM 处理器相较 32 位版本实现了可观的性能提升。

Apple 利用这次架构过渡自行做出了一些改变。最大的变化是引入了内联引用计数（inline retain count），这消除了在常规情况下对 retain 和 release 操作执行昂贵哈希表查找的需要。由于这些操作在大多数 Objective-C 代码中非常普遍，这是一项重大收益。面向对象的资源清理标志在某些情况下使对象销毁快了不少。总的来说，创建和销毁对象的成本大约减半。指针标记（Tagged pointers）技术也带来了良好的性能提升以及内存占用的减少。

ARM64 的加入对 Apple 硬件来说是令人欣喜的。我们都知道这迟早会发生，但很少有人预料到会这么快。它现在就在这里，而且很棒。

今天就到这里。下次再会，继续探索硬件与软件世界的更多冒险。Friday Q & A 系列由读者建议驱动，因此如果在下次之前您脑海中突然冒出一个希望在此探讨的话题，请发送给我们！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html

Ever since the iPhone 5S was announced a couple of weeks ago, the world of tech journalism has been filled with massive quantities of misinformation. Unfortunately, good information takes time, and the world of tech journalism is more about speed than accuracy. Today, as suggested by a variety of readers, I’m going to give the rundown of just what 64-bit ARM in the iPhone 5S means for you, in terms of performance, capabilities, and development.

“64-bit”Let’s start by talking about the general term “64-bit” and what it means. There’s a lot of confusion around this term, and a lot of that is because there’s no single agreed-upon definition of it. However, there is generally some consensus about it, even if it’s not universal.

There are two parts of the CPU that “X-bit” usually refers to: the width of the integer registers, and the width of pointers. Thankfully, in most modern CPUs, these widths are the same. “64-bit” then typically means that the CPU has 64-bit integer registers and 64-bit pointers.

It’s also important to point out the things that “64-bit” does not refer to, as there’s a lot of confusion in this area as well. In particular, “64-bit” does not include:

Physical RAM address size. The number of bits used to actually talk to RAM (and therefore the amount of RAM the hardware can support) is decoupled from the question of CPU bitness. ARM CPUs have ranged from 26 bits to 40 bits, and this can be changed independently from the rest.
Data bus size. The amount of data fetched from RAM or cache is likewise decoupled. Individual CPU instructions may request a certain amount of data, but the amount of data actually fetched can be independent, either by splitting the fetch into smaller parts, or fetching more than is necessary. The iPhone 5 already fetches data from memory in 64-bit chunks, and chunk sizes of up to 192 bits exist in the PC world.
Anything related to floating-point. FPU register size and internal design is independent, and ARM CPUs have had 64-bit FPU registers since well before ARM64.

Generic Advantages and DisadvantagesIf we compare otherwise-identical 32-bit and 64-bit CPUs, there isn’t a whole lot of difference, which is a big part of the confusion around the significance of Apple’s move to 64-bit ARM. The move is important, but largely because of specifics of the ARM processor and Apple’s use of it.

Still, there are some differences. Perhaps the most obvious is that 64-bit integer registers make it more efficient to work with 64-bit integers. You can still work with 64-bit integers on a 32-bit processor, but it typically entails working with it in two 32-bit pieces, which means that arithmetic can take substantially longer. 64-bit CPUs can typically perform arithmetic on 64-bit quantities just as fast as on 32-bit quantities, so code that does heavy manipulation of 64-bit integers will run much faster.

Although 64-bit has no bearing on the amount of RAM that can be used by the CPU itself, it can make it much easier to use large amounts of RAM within a single program. A single program running on a 32-bit CPU only has 4GB of address space. Chunks of that address space are taken up by the operating system and standard libraries and such, typically leaving anywhere from 1-3GB available for use. If a 32-bit system has more than 4GB of RAM, taking advantage of all of it from a single program is tough. You have to resort to shenanigans like asking the operating system to map chunks of memory in and out of your process as you need them, or splitting your program into multiple processes.

This takes a lot of extra programming effort and can slow things down, so few programs actually do it. In practice, a 32-bit CPU limits individual programs to using 1-3GB of RAM each, and the advantage of having more RAM is the ability to run multiple such programs simultaneously, and the ability to cache more data from disk. This is still useful, but there are cases where the ability of a single program to use more RAM is needed.

The increased address space is also useful even on a system without that much RAM. Memory-mapped files are a handy construct, where the contents of a file are logically mapped into a process’s memory space, even though physical RAM is not necessarily allocated for the entire file. On a 32-bit system, a program can’t memory map large files (over, say, a few hundred megabytes) reliably. On a 64-bit system, the available address space is much larger, so there’s no concern with running out.

The increased pointer size comes with a substantial downside: otherwise-identical programs will use more memory, perhaps a lot more, when running on a 64-bit CPU. Pointers have to be stored in memory as well, and each pointer takes twice the amount of memory. Pointers are really common in most programs, so that can make a substantial difference. Increased memory usage can put more pressure on caches, causing reduced performance.

In short: 64-bit can increase performance for certain types of code, and makes certain programming techniques, like memory mapped files, more viable. However, it can also decrease performance due to increased memory usage.

ARM64The iPhone 5S’s 64-bit CPU is not merely a regular ARM processor with wider registers. The 64-bit ARM architecture includes substantial changes from the 32-bit version.

First, a note on the name: the official name from ARM is “AArch64”, but this is a silly name that pains me to type. Apple calls it ARM64, and that’s what I will call it too.

ARM64 doubles the number of integer registers over 32-bit ARM. 32-bit ARM provides 16 integer registers, of which one is a dedicated program counter, two more are given over to a stack pointer and link register, and the other 13 are available for general use. With ARM64, there are 32 integer registers, with a dedicated zero register, link register, and frame pointer register. One further register is reserved for the platform, leaving 28 general purpose integer registers.

ARM64 also increases the number of floating-point registers available. The floating point registers on 32-bit ARM are a bit odd, so it’s tough to compare. It has 32 32-bit floating point registers which can also be viewed as 16 overlapped 64-bit registers, and there are 16 additional independent 64-bit registers. The 32 total 64-bit registers registers can also be viewed as 16 overlapped 128-bit registers. ARM64 simplifies this to 32 128-bit registers, which can also be used for smaller data types, and there’s no overlapping.

The register count can strongly influence performance. Memory is extremely slow compared to CPUs, and reading from and writing to memory takes a long time compared to how long it takes the CPU to process an instruction. CPUs try to hide this with layers of caches, but even the fastest layer of cache is slow compared to internal CPU registers. More registers means more data can be kept purely CPU-internal, reducing memory accesses and increasing performance.

Just how much of a difference this makes will depend on the specific code in question, as well as how good the compiler is at optimizing it to make the best use of available registers. When the Intel architecture moved from 32-bit to 64-bit, the number of registers was doubled from 8 to 16, and this made for a substantial performance improvement. ARM already had substantially more registers than the 32-bit Intel architecture, so the impact of additional registers is smaller, but it’s a still helpful change.

ARM64 also brings some significant changes to the instruction set beyond the increased number of registers.

Most 32-bit ARM can be executed conditionally based on the state of a condition register at the time of execution. This allows compiling if statements and similar without requiring branching. Intended to increase performance, it must have been causing more trouble than it was worth, as ARM64 eliminates conditional execution.

ARM64’s NEON SIMD unit provides full double-precision IEEE754 compliance, whereas the 32-bit version of NEON only supports single-precision, and leaves out some of the harder, more obscure bits of IEEE754.

ARM64 adds specialized instructions for AES encryption and SHA-1 and SHA-256 cryptographic hashes. Not important in general, but potentially a big win if you happen to be doing those things.

Overall, by far the most important changes are the greatly increased number of general-purpose registers, and support for full IEEE754-compliant double-precision arithmetic in NEON. These changes could allow for considerable performance increases in a lot of code.

32-bit CompatibilityIt’s important to note that the A7 includes a full 32-bit compatibility mode that allows running normal 32-bit ARM code without any changes and without emulation. This means that the iPhone 5S runs old iPhone apps with no problem and no performance impact compared to other hardware. 32-bit code does potentially run with somewhat reduced performance since it gets none of the advantages of ARM64.

Apple Runtime ChangesApple takes advantage of architecture changes like this to make changes in their own libraries. Since they don’t need to worry about maintaining binary compatibility across such a change, it’s a good time to make changes that would otherwise break existing apps.

In Mac OS X 10.7, Apple introduced tagged pointers. Tagged pointers allow certain classes with small amounts of per-instance data to be stored entirely within the pointer. This can eliminate the need for memory allocations for many uses of classes like NSNumber, and can make for a good performance boost. Tagged pointers were only supported on 64-bit, partly due to binary compatibility concerns, but partly because 32-bit pointers don’t leave a lot of room left over for actual data once the tag bits are accounted for. Presumably because of that, iOS never got tagged pointers. However, on ARM64, the Objective-C runtime includes tagged pointers, with all of the same benefits they’ve brought to the Mac.

Although pointers are 64 bits, not all of those bits are really used. Mac OS X on x86-64, for example, only uses 47 bits of a pointer. iOS on ARM64 uses even less, with only 33 bits of a pointer currently being used. As long as the extra bits are masked off before the pointer is used, they can be used to store other data. This leads to one of the most significant internal changes in the Objective-C runtime in the language’s history.

Repurposed isa PointerMuch of the information for this section comes from Greg Parker’s article on the relevant changes. Check that out for information straight from the source.

First, a quick refresher: Objective-C objects are contiguous chunks of memory. The first pointer-sized piece of that memory is the isa. Traditionally, the isa is a pointer to the object`s class. For more information on how objects are laid out in memory, see my article on the Objective-C runtime.

Using an entire pointer-sized piece of memory for the isa pointer is a bit wasteful, especially on 64-bit CPUs which don’t use all 64 bits of a pointer. ARM64 running iOS currently uses only 33 bits of a pointer, leaving 31 bits for other purposes. Class pointers are also aligned, meaning that a class pointer is guaranteed to be divisible by 8, which frees up another three bits, leaving 34 bits of the isa available for other uses. Apple’s ARM64 runtime takes advantage of this for some great performance improvements.

Probably the most important performance improvement is an inline reference count. Nearly all Objective-C objects are reference counted (the exceptions being constant objects like NSString literals) and retain/release operations to modify the reference count happen extremely frequently. This is especially true with ARC, which emits even more retain/release calls than a typical human programmer. As such, high performance for retain and release is critical.

Traditionally, the reference count is not stored in the object itself. If the isa is the only field every object shares, then there’s simply no room for any additional data. It would be possible to make it so that every object also contains a reference count field, but this would use up a great deal more memory. This is less important today, but it was a pretty big deal in the earlier days of Objective-C. Because of this, the retain count is stored in an external table.

Any time an object is retained, the runtime goes through this procedure:

Fetch a global retain count hash table.
Lock the table to make the operation thread safe.
Look up the retain count of the object in the table.
Increment the count and store the new value back in the table.
Release the table lock.

This is a bit slow! The hash table implementation used for tracking retain counts is fast, for a hash table, but even the best hash tables are slow compared to direct memory access.

On ARM64, 19 bits of the isa field go to holding the object’s reference count inline. That means that the procedure for retaining an object simplifies to:

Perform an atomic increment of the correct portion of the isa field.

And that’s it! This should be much, much faster.

There is a bit more to it than just that, because of some corner cases that need to be handled. The real code looks more like this:

The bottom bit of the isa indicates whether all this extra data is active for this class. If it’s not active, then fall back to the old hash table approach. This allows for a compatibility mode for classes that fall outside the representable range, or programs that incorrectly assume the isa is a pure class pointer.
If the object is currently deallocating, do nothing.
Increment the retain count, but don’t store it back into the isa just yet.
If it overflowed (an unusual but real possibility with only 19 bits available) then fall back to a hash table.
Perform an atomic store of the new isa value.

Most of this was necessary with the old approach as well, and it doesn’t add too much overhead. The new approach should still be much, much faster.

There are several other performance improvements stuffed into the remaining free bits that make deallocating objects faster. There’s potentially a lot of cleanup that needs to be done when an Objective-C object deallocates, and being able to skip unnecessary cleanup can increase performance. These are:

Whether the object ever had any associated objects, set with objc_setAssociatedObject. If not, then associated objects don’t need to be cleaned up.
Whether the object has a C++ destructor method, which is also used as the ARC automatic dealloc method. If not, then it doesn’t need to be called.
Whether the object has ever been referenced by a __weak variable. If it has, then any remaining __weak references need to be zeroed. If not, then this step can be skipped.

Previously, all of these flags were tracked per-class. If any instance of a class ever had an associated object set on it, for example, then every instance of that class would perform associated object cleanup when deallocating from that point on. Tracking them for each instance independently helps ensure that only the instances that really need it take the performance hit.

Adding it all together, it’s a pretty big win. My casual benchmarking indicates that basic object creation and destruction takes about 380ns on a 5S running in 32-bit mode, while it’s only about 200ns when running in 64-bit mode. If any instance of the class has ever had a weak reference and an associated object set, the 32-bit time rises to about 480ns, while the 64-bit time remains around 200ns for any instances that were not themselves the target.

In short, the improvements to Apple’s runtime make it so that object allocation in 64-bit mode costs only 40-50% of what it does in 32-bit mode. If your app creates and destroys a lot of objects, that’s a big deal.

ConclusionThe “64-bit” A7 is not just a marketing gimmic, but neither is it an amazing breakthrough that enables a new class of applications. The truth, as happens often, lies in between.

The simple fact of moving to 64-bit does little. It makes for slightly faster computations in some cases, somewhat higher memory usage for most programs, and makes certain programming techniques more viable. Overall, it’s not hugely significant.

The ARM architecture changed a bunch of other things in its transition to 64-bit. An increased number of registers and a revised, streamlined instruction set make for a nice performance gain over 32-bit ARM.

Apple took advantage of the transition to make some changes of their own. The biggest change is an inline retain count, which eliminates the need to perform a costly hash table lookup for retain and release operations in the common case. Since those operations are so common in most Objective-C code, this is a big win. Per-object resource cleanup flags make object deallocation quite a bit faster in certain cases. All in all, the cost of creating and destroying an object is roughly cut in half. Tagged pointers also make for a nice performance win as well as reduced memory use.

ARM64 is a welcome addition to Apple’s hardware. We all knew it would happen eventually, but few expected it this soon. It’s here now, and it’s great.

That’s it for today. Check back next time for more adventures in the land of hardware and software. Friday Q&A is driven by reader suggestions, so if an idea pops into your head between now and then for a topic you’d like to see covered here, please send it in!