反汇编剖析(一)

Mike Ash Friday Q&A 中文译文:反汇编剖析(一)

作者 TommyWu
封面圖片: 反汇编剖析(一)

译文 · 原文: Friday Q&A 2011-12-16: Disassembling the Assembly, Part 1 · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2011-12-16-disassembling-the-assembly-part-1.html 发布:2011-12-16 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


作为一点小小的风格转变,今天的博文由客座作者 Gwynne Raskind 撰写。我的上一篇博文略微提及了反汇编目标文件(object files),而 Gwynne 则想深入探讨如何详细解读汇编输出。事不宜迟,请允许我向您呈现她关于阅读 x86_64 汇编语言的精彩深度解析。

在 Michael Ash 12 月 2 日的 Friday Q & A 系列中,他介绍了几种目标文件分析工具,并围绕一段简单的示例代码,演示了如何使用每款工具进行分析。

他的文章只在一个方面有所欠缺:没有详细解释这些工具所展示的汇编语言究竟意味着什么。他这样做也是理所当然的;这是一个高级且错综复杂的主题,值得用单独一篇文章来阐述。我决定撰写那篇文章。

示例代码

我将使用与 Mike 完全相同的代码,在此复现如下:

// clang -framework Cocoa -fobjc-arc test.m
#import <Cocoa/Cocoa.h>
@interface MyClass : NSObject
{
NSString *_name;
int _number;
}
- (id)initWithName: (NSString *)name number: (int)number;
@property (strong) NSString *name;
@property int number;
@end
@implementation MyClass
@synthesize name = _name, number = _number;
- (id)initWithName: (NSString *)name number: (int)number
{
if((self = [super init]))
{
_name = name;
_number = number;
}
return self;
}
@end
NSString *MyFunction(NSString *parameter)
{
NSString *string2 = [@"Prefix" stringByAppendingString: parameter];
NSLog(@"%@", string2);
return string2;
}
int main(int argc, char **argv)
{
@autoreleasepool
{
MyClass *obj = [[MyClass alloc] initWithName: @"name" number: 42];
NSString *string = MyFunction([obj name]);
NSLog(@"%@", string);
return 0;
}
}
  • 注意以下几点:

  • 此代码使用 ARC(自动引用计数)。

  • 因此,此代码仅适用于 64 位环境,并且需要较新版本的 Clang 编译器。

  • 运行时,程序将两次打印 “Prefixname”。

x86 架构速成课
在深入汇编语言本身之前,这里先快速讲解一下 x86_64(又称 AMD64)架构的基础知识。官方参考手册可在 AMD 开发者网站上找到,其中以极其技术性的细节涵盖了你几乎需要了解的关于 CPU 底层运作的所有内容。AMD64 应用程序二进制接口规范填补了一些空白,它定义了在 Intel 处理器上以 64 位模式运行的 C 和 C++ 程序的 ABI(应用程序二进制接口)。AMD64 规范文档记录了 CPU 本身的运行方式,而 ABI 规范则定义了在 CPU 上运行的程序所使用的约定。

在可能的情况下,我将从 x86_64 架构的通用角度进行阐述。程序在这个层面的运作方式,鲜有为 Mac OS X 所特有的部分。虽然 Objective-C 运行时(runtime)所调用的函数与操作系统紧密相关,但调用这些函数的汇编语言指令遵循着与任何 x86_64 系统相同的规范。

注:如果你已经熟悉诸如虚拟内存(virtual memory)、栈(the stack)、堆(the heap)以及 CPU 寄存器(CPU registers)这类概念,可以跳过本节的全部内容。

内存模型

首先,我们来看计算机的内存模型。x86_64 架构(译注:即通常所说的 64 位 x86 架构)指定了一种 “平坦、分页的内存模型”,简单来说,这意味着所有的物理内存都作为一个巨大的连续块来布局,被均匀地划分为大小相等、预先定义好尺寸的 “页”(pages)。

在 x86_64 架构上运行的软件,最大可寻址 48 位宽度的物理内存;这个数值比人们可能预期的 64 位要小,原因在于实际上并没有任何已上市的 CPU 真正支持那么多的地址线。地址始终是 64 位长,但物理内存地址的最高 16 位始终为零。x86_64 规范并未为这 16 位规定任何其他用途(例如标记),它们被保留,以备将来实现的 CPU 既有必要、又有能力一次寻址超过 32 TB 物理内存时使用。

x86_64 架构还要求其实现必须提供虚拟内存(virtual memory)和受保护内存(protected memory)。这意味着操作系统可以这样配置系统:每个运行的进程都看到自己完整的 64 位地址空间(虚拟内存),并且只能看到自己的(受保护内存)。操作系统通过「分页」机制负责确保应用程序获得其实际使用的内存。CPU 会拦截用户态进程的所有内存访问(「虚拟地址」),并在操作系统的协助下将它们转换为物理地址(physical addresses)。关于虚拟内存和分页工作原理的更深入讨论超出了本文范围;目前只需理解每个应用程序都拥有自己独立的 64 位内存空间,且无法看到其他进程的地址空间即可。

注:这可能与您所熟悉的「虚拟内存」(即利用计算机硬盘空间作为额外 RAM)并非同一概念,尽管那种虚拟内存(「分页文件」或「交换空间」)部分地借助了 CPU 的虚拟内存系统来实现。(译注:原文此处提及的传统 “虚拟内存” 概念与现代系统中的实现可能已有细微差异)

这庞大的 64 位地址空间被划分为两个区域:栈(stack)和堆(heap)。栈是预留在地址空间高端的区域(通常位于高端;实际上它几乎可以位于任何位置),供子程序调用和局部变量存储使用。栈始终向下增长;随着栈中信息量的增加,栈顶地址会逐渐减小。在内存模型较小的旧系统上,栈可能过度向下增长而与其他区域冲突,但虽然技术上仍可能发生这种情况,在引发堆冲突之前其他问题早已出现(特别是栈会超出其分配内存页的范围并引发保护故障(protection fault))。CPU 有一些专门用于操作栈的指令,不过在现代代码中更倾向于使用更高效的方法,这些指令往往被闲置。你可以将栈想象成系统在程序启动时分配的一块相当大的内存块。

堆(heap)实质上包含了栈以外的所有内存区域;堆内存由系统在运行时为进程分配使用。实际上堆是包含栈的,尽管两者在概念上通常被视为独立的。所有可执行代码都会被加载到堆的某个区域,你的可执行文件链接的所有库也同样如此。注意:这些库并非实际拷贝 —— 为每个加载的进程复制每个库会极其低效 —— 但在你真正理解虚拟内存(virtual memory)之前,暂且把它们看作拷贝会更容易理解。你的进程在执行期间分配的内存同样来自堆。

CPU 及其寄存器

CPU(中央处理器)是实际执行所有工作的芯片。它负责获取、解码并执行指令流;用实际的话说就是,你给它一堆机器码(machine code),它就会执行代码告诉它做什么。机器码是由编译器(compiler)从源代码(source code)生成的一堆字节。人类也可以手工构建机器码,但这是一个极其艰巨的过程,而且对于过去大约三十年间生产的任何计算机来说,这样做都几乎不值得花费时间。源代码和机器码之间的中间步骤,当然就是汇编语言(assembly language),而人类出于各种原因确实会花费大量时间使用汇编语言工作,这些原因大多与源代码无法做到的事情,或者编译器目前还无法像人类那样优化代码有关。

注:事实上,大多数编译器的工作方式是:将源代码从 C 或其他高级语言编译成汇编语言,然后再将汇编语言转换为机器码(以及其他一些中间步骤)。然而,除非你主动要求,否则你通常看不到汇编语言。

寄存器是 CPU 内部专门预留的存储区域,用于实现近乎瞬时的访问。寄存器用途广泛多样。一个 x86_64 CPU 至少拥有 100 个寄存器 —— 哇!幸运的是,应用开发者即使使用汇编语言编程,通常最多也只需要关注其中大约 20 个。大部分寄存器(仅举几例:包括控制寄存器、调试寄存器、表描述符寄存器、性能监控寄存器和机器检查寄存器)只有内核代码才能访问。其余的寄存器中,大部分如 mmxxmmymm 寄存器仅用于向量代码,而 fpr 寄存器仅用于浮点数运算(存在例外,但根据经验,这通常是一个安全的起点)。此外,应用代码仅会使用 rflags 寄存器的部分位。

x86 架构自原始的 16 位 8086 指令集以来,就有一个奇特之处:许多相同的寄存器可以通过不同的名称进行寻址,这些名称决定了正在读取或写入寄存器的哪一部分。大多数通用寄存器可以寻址到单个字节。例如,rax(累加器)寄存器,即第一个 64 位通用寄存器,也可以被寻址为 eaxrax 的低 32 位)、axrax 的低 16 位)、ahrax 的第二低 8 位)和 alrax 的低 8 位)。这种能力对于处理较小数据类型很有用 —— 例如,要处理有符号的 32 位加法,只需要一条基于 32 位寄存器名称的 add 指令,而不是多条旨在模拟在 64 位寄存器上进行 32 位符号扩展和整数溢出的指令。

一个名为 r*x 的寄存器是 64 位的;e*x 是 32 位的,*x 是 16 位的,*h*l 是 8 位的。对于 r8-r15 寄存器,它们的名称则改为 rN(64 位)、rNd(32 位)、rNw(16 位)和 rNb(8 位)。riprflags 只能作为 64 位寄存器访问,而 rsirdirsprbp 的 8 位版本被命名为 sildilsplbpl

在常规用户空间进程中,需要关注的寄存器包括:

  • rax, rbx, rcx, rdx, r8-r15 —— 这些是通用寄存器(general-purpose registers),在任何时刻都可能被用于各种用途,尽管 ABI(Application Binary Interface,应用程序二进制接口)会将这些寄存器的用途限定在更为具体的范围内。这些寄存器也可分别称为累加器(accumulator,即 rax)、基址寄存器(base register,即 rbx)、计数寄存器(count register,即 rcx)和数据寄存器(data register,即 rdx)。

  • rsi, rdi —— 严格来说,这些是索引寄存器(index registers,分别是 “源索引 source index” 和 “目标索引 destination index”),但在现代代码中,它们通常在 ABI 的规范下被用作通用寄存器。

  • rbp, rsp —— “基址指针”(base pointer)和 “栈指针”(stack pointer)寄存器。它们用于访问栈(stack);CPU 的栈指令(stack instructions)总是假设 rsp 保存着栈顶地址。

  • rflags - 标志寄存器,保存着一系列标志位,用于指示指令执行结果的各类状态。该寄存器无法被直接寻址。受 CPU 标志位影响的操作通常是指令本身的一部分;例如,条件跳转指令会根据当前标志位采取不同行为,而算术运算则会改变标志位。某些指令能直接影响标志位,例如 stcclc 指令,它们分别用于设置和清除进位标志(Carry Flag)。也可以通过将标志寄存器压入栈来直接读取它,或者通过从栈弹出到该寄存器来直接写入。用户态进程可影响的标志位包括:CF - 进位标志。当加法产生进位或减法产生借位时,CF 被置位。它也受算术移位指令和位测试指令影响,被位逻辑指令清除,并可由 stcclccmc 指令直接操作。PF - 奇偶标志。当某些操作的最后结果低字节中 1 的个数为偶数时,PF 被置位。可用于奇偶校验。AF - 辅助进位标志。当算术运算或 BCD(二进制编码十进制)运算从结果的第 3 位(位 3)产生进位或借位时,AF 被置位。其用途局限于直接在 CPU 上进行十进制运算,实际使用较少。ZF - 零标志。当最后一次算术运算结果为零时,ZF 被置位。比较和测试指令也会相应地设置或清除 ZF。它常被用于相等性测试,因为比较两个相等的操作数时 ZF 会被置位。SF - 符号标志。如果最后一次算术运算结果为负,则 SF 被置位。更准确地说,算术运算后,SF 被设为结果最高有效位的值。DF - 方向标志。DF 用于控制字符串指令在操作期间是递增还是递减 rsirdi 寄存器,可通过 stdcld 指令操作。此标志在现代代码中很少使用,因为字符串指令本身已不常用。OF - 溢出标志。当最后一次有符号算术运算结果的符号与两个源操作数的符号都不同(译注:即结果超出有符号数能表示的范围)时,OF 被置位。这意味着结果对于目标寄存器而言太大或太小无法容纳。

  • CF(进位标志)。当加法结果产生进位或减法结果产生借位时,CF 被置位。它也会受算术移位指令和位测试指令影响,被位逻辑指令清除,并可由 stcclccmc 指令直接操控。

  • PF(奇偶标志)。当某些操作的最终结果低字节中 1 的个数为偶数时,PF 被置位。它可用于奇偶校验。

  • AF(辅助进位标志)。当算术运算或 BCD(二进制编码的十进制)运算从结果的第 3 位产生进位或借位时,AF 被置位。其用途局限于在 CPU 上直接进行十进制运算,现已较少使用。

  • ZF(零标志)。当上一次算术运算结果为零时,ZF 被置位。比较和测试指令也会适当地设置或清除 ZF。它常用于相等性测试,因为在比较两个相等的操作数时会被置位。

  • SF(符号标志)。如果上一次算术运算结果为负数,SF 被置位。更准确地说,算术运算后,SF 被设置为结果最高有效位的值。

  • DF(Direction Flag,方向标志)。DF 用于控制字符串指令(string instructions)在执行期间是递增还是递减 rsirdi 寄存器,可通过 stdcld 指令来操作。现代代码中很少使用此标志,因为字符串指令本身已不常使用。

  • OF(Overflow Flag,溢出标志)。当最后一次有符号算术运算的结果的符号与两个源操作数的符号都不同时,OF 被置位。这意味着结果对于目标位置而言太大或太小,无法容纳。

  • rip — 指令指针寄存器(instruction pointer register)。它保存着 CPU 当前正在执行的指令的内存地址。在 x86_64 架构中,rip 可以被直接寻址,但仅能用作内存偏移量。要写入 rip,必须执行众多控制转移指令(control transfer instructions)中的一种。随着指令的执行,rip 的值会按每条指令的大小增加(x86 架构中指令大小变化很大),但控制转移指令除外,它们通过根据转移目标改变 rip 的值来工作。

调用约定
一个架构的调用约定(calling conventions),通常也是人们说的 ABI 所指的内容,规定了函数接收参数、返回值、管理栈等基本方式,这些并非 CPU 架构本身的一部分。x86_64 的调用约定相当复杂,因此我将在此提供一个简化版本,足以帮助你理解所有示例代码。

幸运的是,示例代码中没有函数需要非整数参数,或大量参数。有人可能立刻会抗议说 char **NSString *id 肯定不是整数!然而,就函数参数传递而言,整数是指一个值能装入架构的位宽内,对于 x86_64 就是 64 位。指针正好是这个大小,而 int 则更小(x86_64 是 LP64 架构,即 long 为 64 位,但 int 是 32 位)。

函数参数中的整数参数通过一系列寄存器传递。第一个参数存放在 rdi 寄存器中,第二个存放在 rsi,第三个在 rdx,然后依次是 rcxr8r9。如果整数参数多于这些寄存器,剩余的参数会按照从右到左(逆序)的顺序压入栈中。

除此之外,针对此代码我们唯一需要关注的调用约定特殊之处是变长函数(variadic function)的参数传递序列。变长函数是指使用 stdarg 接口来接受可变数量参数的函数。在我们的示例中,NSLog 就是这样的函数。在汇编语言层面,变长函数接收参数的方式只有一点特殊之处:al 寄存器(rax 寄存器的低 8 位)的字节值用于指定传递给该函数的向量寄存器(vector register)的数量。由于示例代码未使用任何向量寄存器,因此该数值始终为零。

最后,函数将简单的整数值(再次提醒,就本讨论而言,指针也被视为整数值)返回到 rax 寄存器中,在少数情况下也会使用 rdx 寄存器。

完整调用约定(calling conventions)要复杂得多;如果您感兴趣,可以查阅 AMD64 应用二进制接口规范。

汇编语言 程序的完整反汇编代码长达 645 行。为清晰起见,我不会在此粘贴全文。我将跟随代码进程进行解释。您可以在编译示例代码的目录下运行 /usr/bin/clang -S test.m -o test.s -fobjc-arc 命令,然后查看 test.s 文件来自行反汇编。这是编译器生成的汇编代码,其注释比任何其他方式都要完善,因为编译器无需猜测任何名称或位置。

查看反汇编代码,可能会注意到其中包含八个函数。八个?示例代码明明只有三个!另外五个从何而来?其中四个方法是编译器根据 @synthesize 指令合成的,而 [MyClass .cxx_destruct] 方法则是编译器为执行 C++ 和 ARC 相关清理而创建的。

main main 函数的代码如下:

int main(int argc, char **argv)
{
@autoreleasepool
{
MyClass *obj = [[MyClass alloc] initWithName: @"name" number: 42];
NSString *string = MyFunction([obj name]);
NSLog(@"%@", string);
return 0;
}
}

译者输出的汇编语言版本为简洁起见已剔除了几条令人困惑的指令:

_main:
pushq %rbp
movq %rsp, %rbp
subq $96, %rsp
leaq L__unnamed_cfstring_23(%rip), %rax
leaq L__unnamed_cfstring_26(%rip), %rcx
movl $42, %edx
leaq l_objc_msgSend_fixup_alloc(%rip), %r8
movl $0, -4(%rbp)
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rax, -48(%rbp) ## 8-byte Spill
movq %rcx, -56(%rbp) ## 8-byte Spill
movq %r8, -64(%rbp) ## 8-byte Spill
movl %edx, -68(%rbp) ## 4-byte Spill
callq _objc_autoreleasePoolPush
movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rcx
movq %rcx, %rdi
movq -64(%rbp), %rsi ## 8-byte Reload
movq %rax, -80(%rbp) ## 8-byte Spill
callq *l_objc_msgSend_fixup_alloc(%rip)
movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
movq %rax, %rdi
movq -56(%rbp), %rdx ## 8-byte Reload
movl -68(%rbp), %ecx ## 4-byte Reload
callq _objc_msgSend
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
movq %rax, %rdi
callq _objc_msgSend
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, %rdi
movq %rax, -88(%rbp) ## 8-byte Spill
callq _MyFunction
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, -32(%rbp)
movq -88(%rbp), %rax ## 8-byte Reload
movq %rax, %rdi
callq _objc_release
movq -32(%rbp), %rsi
movq -48(%rbp), %rdi ## 8-byte Reload
movb $0, %al
callq _NSLog
movl $0, -4(%rbp)
movl $1, -36(%rbp)
movq -32(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release
movq -24(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release
movq -80(%rbp), %rdi ## 8-byte Reload
callq _objc_autoreleasePoolPop
movl -4(%rbp), %eax
addq $96, %rsp
popq %rbp
ret

哇!用汇编写 main 函数还挺长的,对吧?这里有一些重要的东西需要识别:

  • 根据 ABI(应用程序二进制接口),rdi 是用于整数 / 指针参数的第一个参数寄存器,它包含 argc 的值。
  • 同样地,rsi 包含 argv 的值。
  • 再同理,rdx 持有 envp 的值。即使 envp 没有被声明为 main 的一个参数,这也成立!
  • 最后,rcx 持有一个更神秘的 "exec_path" 参数的值,它的存在是我偷看 C 运行时 start 函数的反汇编代码时才发现的。
  • 并且,根据 x86 的惯例,rsp 指向栈顶。因为 mainstart 的子程序,rsp 所指向的 8 个字节就是 main 的返回地址(return address),即 start 中的下一条指令。

我们逐条指令来看。

  • pushq %rbp - 开头很简单。将基址指针(base pointer)保存到栈上,以便我们稍后可以恢复它。ABI 规定 rbp 必须在函数调用间保持不变,所以既然它即将改变,就先保存起来。

pushq % rbp - 开局相当简单。将基址指针(base pointer)保存到栈上,以便稍后恢复。ABI(应用二进制接口)规定,在函数调用过程中必须保留 rbp 的值,因此由于它即将被修改,需要将其保存起来。

  • movq % rsp,% rbp - 将 rsp 复制到 rbp。这是标准 C 函数序言(prologue)的一部分,用于设置栈空间以存放因任何原因未能放入寄存器中的局部变量。

movq % rsp,% rbp - 将 rsp 复制到 rbp。这是标准 C 函数序言(prologue)的一部分,用于设置栈空间以存放因任何原因未能放入寄存器中的局部变量。

  • subq 96,96,%rsp - 汇编语言中,`后跟的数字是用作指令操作数的字面十进制数。因此这行指令将rsp` 减去 96,使栈空间增长 96 字节。这是编译器计算出的、函数剩余部分所需的栈空间大小。

subq 96,96,%rsp - 汇编语言中,`后跟的数字是用作指令操作数的字面十进制数。因此这行指令将rsp` 减去 96,使栈空间增长 96 字节。这是编译器计算出的、函数剩余部分所需的栈空间大小。

  • leaq L__unnamed_cfstring_23 (% rip),% rax - 将 L__unnamed_cfstring_23 的地址加载到 rax 寄存器中,使用 rip 寄存器作为基地址。rip 相对寻址通常用于加载诸如常量字符串和选择子(selector)名称这类内容,也用于快速分支跳转。这条特定的加载指令从其在可执行文件中的存储位置抓取字符串 @”%@“。该字符串稍后将被用作方法参数。

leaq L__unnamed_cfstring_23 (% rip),% rax - 将 L__unnamed_cfstring_23 的地址加载到 rax 寄存器中,使用 rip 寄存器作为基地址。rip 相对寻址通常用于加载诸如常量字符串和选择子名称这类内容,也用于快速分支跳转。这条特定的加载指令从其在可执行文件中的存储位置抓取字符串 @”%@“。该字符串稍后将被用作方法参数。

  • leaq L__unnamed_cfstring_26 (% rip),% rcx - 与上述相同,但将 @“name” 加载到 rcx 中。

leaq L__unnamed_cfstring_26 (% rip),% rcx - 与上述相同,但将 @“name” 加载到 rcx 中。

  • movl $42,% edx - 将 32 位值 42 加载到 edx(即 rdx 的低 32 位)中。此值稍后也将被使用。

movl $42,% edx - 将 32 位值 42 加载到 edx(即 rdx 的低 32 位)中。此值稍后也将被使用。

leaq l_objc_msgSend_fixup_alloc(%rip),%r8 - 从可执行文件的 Objective-C 段中获取 l_objc_msgSend_fixup_alloc 符号(Symbol)的地址,并将该地址保存到 r8 寄存器中。这个地址同样会在后续使用。

movl $0, -4(%rbp) - 将一个 32 位的零值加载到栈的底部。这有助于提醒我们栈是向下增长的;已知 % rbp 指向栈底(即栈存在的最高地址),因此这条指令实际上是将栈的最后四个字节设置为零。那么它到底起到了什么作用呢?事实证明,无论从哪个角度看,它都完全没有实际作用!这是编译器为了确保后续指令(如下一条所示)不会使用到垃圾值(Garbage value)而采取的措施,尽管这个值之后根本不会被再次读取。

这段内容提醒我们栈是向低地址增长的;已知%rbp指向栈底,即栈存在的最高地址,那么这条指令实际上是将栈的最后四个字节清零。

那么这行代码实际有什么作用?实际上,从所有实用角度来看,它完全没有任何作用!这是编译器确保后续不会使用垃圾值的结果,正如接下来的指令所示,尽管该值再也不会被读取。

  • movl %edi, -8(%rbp) - 将edirdi寄存器的低 32 位)保存到栈上。由于edi是第一个整数参数寄存器,这实际上是argc的值。之前将栈最后 32 位清零的指令现在稍微合理一些;同样的效果可以通过类似*rbp = ((int64_t)argc & 0x00000000FFFFFFFF);的代码实现,但对argc的值进行符号扩展(symbol extension)和位与操作需要更多指令。然而,对于这个未优化的编译器来说,这条指令也被证明是无用的,因为argc的值实际上从未被使用过。

movl %edi, -8(%rbp) - 将 edirdi 的低 32 位)保存到栈(stack)上。由于 edi 是第一个整数参数寄存器(integer argument register),这实际上是 argc 的值。前一条指令,将栈的最后 32 位设置为零,现在看起来更有意义;同样的效果可以通过类似 *rbp = ((int64_t)argc & 0x00000000FFFFFFFF); 的代码实现,只不过对 argc 的值进行符号扩展(sign-extending)和与操作(ANDing)会多出几个操作。不幸的是,对于未优化编译器(unoptimizing compiler)的记录来说,这条指令也被证明是无用的,因为 argc 的值从未实际使用。

  • movq %rsi, -16(%rbp) - 将 rsi(目前也称为 argv)保存到栈上。这是连续第三条无用的指令,因为 argv 也没有被使用。

movq %rsi, -16(%rbp) - 将 rsi(目前也称为 argv)保存到栈上。这是连续第三条无用的指令,因为 argv 也没有被使用。

  • movq % rax, -48 (% rbp) ## 8 字节溢出 movq % rcx, -56 (% rbp) ## 8 字节溢出 movq % r8, -64 (% rbp) ## 8 字节溢出 movl % edx, -68 (% rbp) ## 4 字节溢出 将 rax(字符串 @"%@")、rcx(字符串 @"name")、r8(指向 l_objc_msgSend_fixup_alloc 的指针)和 edx(数字 42)作为 “溢出” 值保存在栈上。那么什么是溢出值?你可能会问。当编译器需要一个寄存器来存储一个值时(通常是因为某个函数调用的参数需要放在特定的寄存器中),如果当前没有可用的寄存器,就会发生寄存器溢出。所需寄存器中当前存储的值会被保存到栈上(“溢出”),以便之后可以恢复(“重载”)。在这个优化被关闭的情况下,编译器没有进行任何数据流分析,因此无法意识到所有这些溢出都是不必要的,于是所有存放在有用寄存器中的值都被溢出了。
movq %rax, -48(%rbp) ## 8-byte Spill
movq %rcx, -56(%rbp) ## 8-byte Spill
movq %r8, -64(%rbp) ## 8-byte Spill
movl %edx, -68(%rbp) ## 4-byte Spill

将 rax(字符串 @”%@”)、rcx(字符串 @“name”)、r8(指向 l_objc_msgSend_fixup_alloc 的指针)和 edx(数字 42)保存到栈中作为” 溢出值”。

你可能会问,到底什么是溢出值?寄存器溢出(register spill)发生在编译器需要某个寄存器来存储值时 —— 通常是作为函数调用的参数,因为参数需要传入特定寄存器 —— 但此时没有可用寄存器。所需寄存器中的原始值会被保存到栈上(即” 溢出”),以便后续可以恢复(“重载”)。在当前关闭优化的情况下,编译器缺乏必要的数据流分析能力,无法识别所有这些溢出操作其实都是不必要的,因此所有有用寄存器中的内容都被溢出了。

  • callq _objc_autoreleasePoolPush - 对 objc_autoreleasePoolPush() 进行子程序调用。子程序调用包含两个操作,它们在其他指令面前是原子性的(即不会在中途被抢占):将下一条待执行指令的地址压入栈中,并执行分支跳转至被调用函数的首条指令地址。由于 objc_autoreleasePoolPush() 不接受任何参数,大部分寄存器中的内容无关紧要。然而当它返回时,rax 寄存器包含其 void * 返回值,该指针充当一个不透明句柄(opaque handle),指向自动释放池栈(pool stack)上新自动释放池的位置。这个值对 Objective-C 代码而言是不可见的,后者只能看到 @autoreleasepool 语句。

callq _objc_autoreleasePoolPush — 调用子程序 objc_autoreleasePoolPush()。一次子程序调用包含两个操作,这些操作相对于其他指令是原子性的(即执行过程不会被中途打断):将下一条待执行指令的地址压入栈中,然后执行一个跳转(branch),跳转到被调用函数的第一条指令的地址。由于 objc_autoreleasePoolPush() 不接收任何参数,大部分寄存器的内容无关紧要。然而,当它返回时,rax 寄存器包含其 void * 类型的返回值,这是一个指针,用作自动释放池栈中新增自动释放池位置的不透明句柄(译注:现代系统可能已变化)。此值对 Objective-C 代码不可见,后者只能看到 @autoreleasepool 语句。

  • movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rcx
    movq %rcx, %rdi
    movq -64(%rbp), %rsi ## 8-byte Reload
    movq %rax, -80(%rbp) ## 8-byte Spill
    callq *l_objc_msgSend_fixup_alloc(%rip)

将位于 rip + L_OBJC_CLASSLIST_REFERENCES_$_ 处的值加载到 rcx,将 rcx 复制到 rdi,从栈中重新加载 l_objc_msgSend_fixup_alloc 的地址到 rsi,将 rax(autorelease 池句柄)溢出到栈上,最后,对 l_objc_msgSend_fixup_alloc 进行子程序调用。L_OBJC_CLASSLIST_REFERENCES_$_ 是 MyClass 类对象的符号。先加载到 rcx 然后立即复制到 rdi,这再次体现了数据流分析不足的问题;编译器盲目选择第一个可用的寄存器来加载值,然后再将它存入第一个整型参数寄存器。是什么规则使它认为 rcx 是第一个可用的寄存器?rax 作为返回值在接下来几条指令前仍在使用中,而 rbx 不被考虑,因为它的值在函数调用期间会被保留,这使得它成为一个非常不受优先使用的寄存器。至此,MyClass 类对象是参数 1。从栈中重新加载操作将指向 l_objc_msgSend_fixup_alloc 的指针拉取到参数 2。rax 的溢出保存了 autorelease 池句柄,因为 rax 将会被子例程的返回值破坏。而 l_objc_msgSend_fixup_alloc 是一个虚表调用;真正的 alloc 方法的地址将在运行时为了优化目的被 “修正”。因此,这一指令序列相当于一次优化的 Objective-C 消息发送(message sending)。回想一下,每个 Objective-C 方法都有两个隐藏参数,self_cmd。在当前情况下,self[MyClass class],而 _cmdalloc(或者更确切地说,是一个指向所有类共用的通用 alloc 方法的虚表指针)。紧接着是一个非常类似的指令序列。

movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rcx
movq %rcx, %rdi
movq -64(%rbp), %rsi ## 8-byte Reload
movq %rax, -80(%rbp) ## 8-byte Spill
callq *l_objc_msgSend_fixup_alloc(%rip)

rip + L_OBJC_CLASSLIST_REFERENCES_$_ 处的值加载到 rcx,把 rcx 复制到 rdi,从栈中重新加载 l_objc_msgSend_fixup_alloc 的地址到 rsi,将 rax(自动释放池句柄)溢出到栈中,最后,调用子程序 l_objc_msgSend_fixup_alloc

L_OBJC_CLASSLIST_REFERENCES_$_MyClass 类对象的符号。先加载到 rcx 然后立即复制到 rdi,这又是缺乏数据流分析(data-flow analysis)导致的问题;编译器盲目选择第一个可用的寄存器来加载值,然后再把它存入第一个整数参数寄存器。

是什么规则让它认为 rcx 是第一个可用的寄存器?rax 在接下来的几条指令中仍作为返回值在使用,而 rbx 则未被考虑,因为它的值在函数调用间是保留的,这使得它成为一个非常不受待用的寄存器。

至此,MyClass 类对象是第一个参数。从栈中重新加载将 l_objc_msgSend_fixup_alloc 的指针存入第二个参数。对 rax 寄存器的溢出操作保存了自动释放池句柄,因为子程序返回时 rax 的值会被覆盖。而 l_objc_msgSend_fixup_alloc 是一个虚表(vtable)调用;真正 alloc 方法的地址将在运行时为了优化目的而被「修正」。

因此这段序列相当于一次经过优化的 Objective-C 消息发送(message sending)。回顾一下,每个 Objective-C 方法都带有两个隐藏参数:self_cmd。在此例中,self[MyClass class]_cmdalloc(更准确地说,是一个指向所有类通用 alloc 方法的虚表指针)。接下来会有一个非常相似的序列。

  • movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
  • movq %rax, %rdi
  • movq -56(%rbp), %rdx ## 8-byte Reload
  • movl -68(%rbp), %ecx ## 4-byte Reload
  • callq _objc_msgSend

rip + L_OBJC_SELECTOR_REFERENCES_27 处的值加载到 rsi 寄存器,将 rax 复制到 rdi,重新加载 @"name"rdx,重新加载 42 到 ecx,然后调用子程序 _objc_msgSend

L_OBJC_SELECTOR_REFERENCES_27[MyClass initWithName:number:] 的 selector(选择子),被放入 rsi(即第 2 个参数)。rax 存放着 alloc 的返回值,即新创建的 MyClass 对象,它被复制到第 1 个参数的位置。第 3 个参数 rdx 被加载为常量 NSString @"name",第 4 个参数则加载数字 42。最后,调用 objc_msgSend()。这就是 [ initWithName:@"name" number:42] 的调用序列。init 方法将把 self 的值通过 rax 寄存器返回。

movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
movq %rax, %rdi
movq -56(%rbp), %rdx ## 8-byte Reload
movl -68(%rbp), %ecx ## 4-byte Reload
callq _objc_msgSend

rip + L_OBJC_SELECTOR_REFERENCES_27 处的值加载到 rsi,将 rax 复制到 rdi,重新将 @"name" 加载到 rdx,将数字 42 加载到 ecx,然后调用子程序 objc_msgSend

L_OBJC_SELECTOR_REFERENCES_27[MyClass initWithName:number:] 的 selector(选择子),被放入 rsi(即第二个参数)。rax 持有 alloc 的返回值(即新的 MyClass 对象),它被复制到第一个参数。第三个参数 rdx 被加载了常量 NSString @"name",第四个参数是数字 42。最后调用 objc_msgSend()。这是 [ initWithName:@"name" number:42] 的调用序列。init 方法将把 self 的值通过 rax 返回。

  • movq %rax, -24(%rbp)movq -24(%rbp), %rax - 没错,这两条指令完全是冗余的。因为 -24(%rbp) 在后面会被使用,保存这个值是好的。不幸的是,立即重新加载回 rax 是没有道理的。

movq %rax, -24(%rbp)movq -24(%rbp), %rax - 没错,这两条指令完全是冗余的。因为 -24(%rbp) 在后面会被使用,保存这个值是好的。不幸的是,立即重新加载回 rax 是没有道理的。(译注:此处为对汇编代码冗余性的直接说明,原文重复了一遍相同内容。)

- movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi movq %rax, %rdi callq _objc_msgSend

希望你现在已经掌握这个模式了;这相当于 objc_msgSend(rax, @selector(name));。返回值照例存放在 rax 中。

movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
movq %rax, %rdi
callq _objc_msgSend

希望到目前为止你已经掌握了要领;这里执行的是 objc_msgSend(rax, @selector(name));。返回值照例存放在 rax 中。

  • movq %rax, %rdicallq _objc_retainAutoreleasedReturnValue 现在应该很清晰了。objc_retainAutoreleasedReturnValue(obj); 是由 ARC(自动引用计数)插入的,目的是保持 name 方法的返回值存活,因为 Objective-C 编译器隐式创建的用于持有该值的临时变量被隐式声明为 __strong

  • movq %rax, %rdi movq %rax, -88(%rbp) ## 8-byte Spill callq _MyFunction movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue

    保存 name 的返回值,将其复制为 MyFunction() 的第一个参数,调用 MyFunction(),然后对其返回值调用 objc_retainAutoreleasedReturnValue()

movq %rax, %rdi
movq %rax, -88(%rbp) ## 8-byte Spill
callq _MyFunction
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue

name 的返回值保存下来,作为第一个参数传递给 MyFunction(),调用 MyFunction(),然后对它的返回值调用 objc_retainAutoreleasedReturnValue()

movq %rax, -32(%rbp)
movq -88(%rbp), %rax ## 8-byte Reload
movq %rax, %rdi
callq _objc_release

保存 MyFunction() 的返回值。然后,重新加载 [MyClass name] 的结果,并对它调用 objc_release(),因为 ARC 已经注意到它不再被使用。

movq %rax, -32(%rbp)
movq -88(%rbp), %rax ## 8-byte Reload
movq %rax, %rdi
callq _objc_release

保存 MyFunction () 的返回值。然后,重新加载 [MyClass name] 的结果,并对它调用 objc_release (),因为 ARC(Automatic Reference Counting,自动引用计数)注意到它不再被使用。

  • movq -32(%rbp), %rsi movq -48(%rbp), %rdi ## 8-byte Reload movb $0, %al callq _NSLog

一个简单的 NSLog () 调用,唯一奇怪的特点是将 al 设置为零。因为 NSLog () 是一个 variadic function(可变参数函数),calling convention(调用约定)规定 al 持有调用时使用的 vector registers(向量寄存器)数量。由于没有使用向量寄存器,所以它就被设置为零。

movq -32(%rbp), %rsi
movq -48(%rbp), %rdi ## 8-byte Reload
movb $0, %al
callq _NSLog

一个对 NSLog() 的简单调用,唯一的特殊之处是将 al 设置为零。由于 NSLog() 是一个可变参数函数(variadic function),调用约定(calling convention)规定 al 寄存器用于记录调用时所使用的向量寄存器数量。这里没有使用任何向量寄存器,因此只需将其设置为零。

  • movl $0, -4(%rbp)movl $1, -36(%rbp) - 我必须承认,我完全看不出编译器为何要将零和一这两个值放到看似栈上随机的位置,无论是在此处还是在 main() 函数的任何其他地方。在优化后的代码版本中,没有任何地方使用了类似这些的值。存储零至少在下方得到了使用,但存储一个 1 的操作似乎完全没有任何意义。

movl $0, -4(%rbp)movl $1, -36(%rbp) - 我必须承认,我完全看不出编译器为何要将零和一这两个值放到看似栈上随机的位置,无论是在此处还是在 main() 函数的任何其他地方。在优化后的代码版本中,没有任何地方使用了类似这些的值。存储零至少在下方得到了使用,但存储一个 1 的操作似乎完全没有任何意义。

movq -32 (% rbp), % rdx movq % rdx, % rdi callq _objc_release 释放 MyFunction () 的返回值 —— 你已经在纸上记下哪些值去了「stack(栈)」上的哪些「offsets(偏移量)」和哪些「registers(寄存器)」了,对吧?如果没有,那么你现在有点困惑也不足为奇了。

movq -32(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release

释放 MyFunction() 的返回值 —— 你已经准备了一张纸来记录哪些值存放在栈的哪个偏移量以及哪个寄存器(registers)里了,对吧?如果还没准备,那你现在感到有点困惑也不奇怪。

  • movq -24(%rbp), %rdx movq %rdx, %rdi callq _objc_release

现在释放 obj,也就是我们之前分配的 MyClass 类对象。

movq -24(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release

现在释放 obj,我们之前分配的 MyClass 类的对象。

  • movq -80 (% rbp), % rdi ## 8 字节重新加载
    callq _objc_autoreleasePoolPop

重新加载自动释放池(autorelease pool)句柄,并通过调用 objc_autoreleasePoolPop () 弹出它。这是由 @autoreleasepool 语句的结束大括号 } 插入的代码。

movq -80(%rbp), %rdi ## 8-byte Reload
callq _objc_autoreleasePoolPop

重新加载自动释放池句柄(autorelease pool handle)并通过调用 objc_autoreleasePoolPop() 弹出它。这是由 @autoreleasepool 语句的右花括号 } 插入的代码。

  • 将栈上的零加载到 eax 作为 main 的返回值。
  • 将栈指针恢复到 main 被调用时的原始位置。
  • 将 rbp 的原始值从栈中弹出并放回 rbp。
  • 将下一条指令的地址从栈中弹出到 rip,也称为从子程序调用返回。
movl -4(%rbp), %eax
addq $96, %rsp
popq %rbp
ret

将栈上的零作为 main 的返回值加载到 eax 中。

将栈指针恢复到 main 被调用时的原始位置。

从栈中弹出 rbp 的原始值并存回 rbp。

从栈中弹出下一条指令的地址到 rip 中,这也被称为从子例程调用返回。

这就是 main ()!真是一个冗长而混乱的过程。

我必须在此承认,在一个关键方面我特意让这段代码变得难以理解:我一直使用的是编译器生成的未优化版本代码。令人惊讶的是,用 - Os 选项构建的代码要容易理解得多,其中大量冗余工作被完全消除,寄存器的使用效率也高得多。而且栈上的操作也几乎没有,因为处于优化模式的编译器可以自由使用更大的临时寄存器(scratch registers)池。

我这样做是因为,除非你能理解一个未优化的程序例程的控制流,否则阅读优化后的代码毫无意义。从优化后的代码开始学习,有点像在浅到连脸都浸不下的水里学游泳,但除非编译器为了提升速度或减少体积做出了奇妙而棘手的操作 —— 这时它突然变得有点像一头扎进奥运会游泳池的深水区。

总结
以上就是本文的全部内容,但这只是系列文章的第一部分。希望您读到这里觉得有所收获;在第二部分中,我将探讨示例代码中的其余方法,以及代码的优化版本和 C 运行时的起始函数。


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2011-12-16-disassembling-the-assembly-part-1.html

As a small change of pace, today’s post is written by guest author Gwynne Raskind. My last post touched a bit on disassembling object files, and Gwynne wanted to dive deeply into just how to read the output in detail. Without further ado, I present her wonderful in-depth look at reading x86_64 assembly.

In the December 2 edition of his Friday Q&A series, Michael Ash wrote about several tools for object file analysis, based around a simple piece of sample code which he ran through each tool for examples to show.

His article is lacking in only one respect: it doesn’t go into detail about what the assembly language that these tools show actually means. It’s just common sense that he didn’t; it’s an advanced and intricate topic, deserving of an article of its own. I decided to write that article.

The Sample CodeI’ll be using exactly the same code that Mike did, replicated here:

// clang -framework Cocoa -fobjc-arc test.m
#import <Cocoa/Cocoa.h>
@interface MyClass : NSObject
{
NSString *_name;
int _number;
}
- (id)initWithName: (NSString *)name number: (int)number;
@property (strong) NSString *name;
@property int number;
@end
@implementation MyClass
@synthesize name = _name, number = _number;
- (id)initWithName: (NSString *)name number: (int)number
{
if((self = [super init]))
{
_name = name;
_number = number;
}
return self;
}
@end
NSString *MyFunction(NSString *parameter)
{
NSString *string2 = [@"Prefix" stringByAppendingString: parameter];
NSLog(@"%@", string2);
return string2;
}
int main(int argc, char **argv)
{
@autoreleasepool
{
MyClass *obj = [[MyClass alloc] initWithName: @"name" number: 42];
NSString *string = MyFunction([obj name]);
NSLog(@"%@", string);
return 0;
}
}

Some things to notice right away:

  • This code uses ARC.

  • Accordingly, this code is 64-bit only and requires a recent version of the Clang compiler.

  • When run, the program will print “Prefixname” twice.

A Crash Course in x86 ArchitectureBefore diving into the assembly language itself, here’s a quick lesson in the basics of the x86_64 (aka AMD64) architecture. The official reference manuals can be found at the AMD developer website, and cover in extremely technical detail almost everything you’ll ever need to know about the underlying workings of the CPU. Several gaps are filled in by the AMD64 Application Binary Interface Specification, which defines the Application Binary Interface (ABI) for C and C++ programs running in 64-bit mode on an Intel processor. The AMD64 specifications document the running of the CPU itself, while the ABI spec defines the conventions used by programs running on the CPU.

Where possible, I will speak in general terms about x86_64 architecture. Very little about how programs work at this level is specific to Mac OS X. While the functions called by the Objective-C runtime are very much OS-specific, the assembly language instructions that call those functions follow the same specifications as any x86_64 system.

Note: If you’re already familiar with such concepts as virtual memory, the stack, the heap, and CPU registers, you can skip this entire section.

A Model of MemoryFirst, we look at the memory model of the computer. The x86_64 architecture specifies a “flat, paged memory model”, which in simple terms means that all of the physical memory is laid out as one enormous block, divided up evenly into equally-sized “pages” of a predefined size. Software running on the x86_64 architecture can address a maximum of 48 bits worth of physical memory; this is less than the 64 bits one might expect due to the fact that no shipping CPU actually supports that many address lines. Addresses are always 64 bits long, but the top 16 bits of a physical memory address are always zero. The x86_64 specification does not provide for any other use of those 16 bits, such as tagging; they are reserved for a time when future implementations have both the need and the capability of addressing more than 32 TB of physical RAM at a time.

x86_64 also requires the implementation to provide virtual memory and protected memory. This means that the OS can set up the system in such a way that every process it runs sees its own complete 64-bit address space (virtual memory), and only its own (protected memory). The OS is responsible for ensuring that an application gets the memory it actually uses by the use of “paging”. The CPU intercepts all memory accesses by userland processes (“virtual addresses”) and translates them to physical addresses with the help of the OS. A more in-depth description of how virtual memory and paging work is beyond this article; for now, it’s enough to understand that every application has its own individual 64-bit memory space and can’t see any other process’ space.

Note: This is not the same as the “virtual memory” which you may be familiar with, using space on the computer’s hard drive as extra RAM, though that kind of virtual memory (a “paging file”, or “swap space”) is implemented in part by use of the CPU’s virtual memory system.

This enormous 64 bits worth of address space is divided up into two areas: The stack and the heap. The stack is an area set aside high in the address space (typically high, anyway; in practice it can be just about anywhere) for the use of subroutine calls and local variable storage. The stack always grows downward; as the amount of information on the stack increases, the address of the top of the stack decreases. On older systems with smaller memory models, it was possible for the stack to grow too far downward and collide with other areas, but while it’s still technically possible for this to happen, other things would go wrong long before a heap collision (in particular, the stack would run off the edge of its allocated memory pages and cause a protection fault). The CPU has a few instructions specifically designed for manipulating the stack, though they often go unused in favor of more efficient methods in modern code. You can think of the stack as a moderately large chunk of memory allocated by the system at the launch of your program.

The heap effectively consists of every area of memory that is not the stack; memory from the heap is allocated at runtime by the system for the process’ use. The heap contains the stack, in fact, though they are usually considered conceptually separate. All of your executable code is loaded into a section of the heap, as well as copies of any libraries your executable links to. Note: These are not actually copies, as it would be ridiculously inefficient to copy every library for every loaded process, but it’s easier to just think of them as copies until you have a good grasp of virtual memory. Memory allocated by your process during its execution also comes from the heap.

The CPU and Its RegistersThe CPU is the chip that actually does all the work. It fetches, decodes, and executes an instruction stream; what this means in practical terms is you give it a bunch of machine code and it does what the code tells it. Machine code is a bunch of bytes generated from source code by a compiler. A human could build machine code by hand, but it would be an exceedingly arduous process, and it’s rarely if ever worth the time to do with any computer made in the last thirty years or so. The intermediate step between source code and machine code is, of course, assembly language, and humans do spend a lot of time working in assembly language for various reasons, mostly having to do with either things source code can’t do or things compilers can’t optimize as well as humans - yet.

Note: In fact, most compilers work by compiling the source code from C or another high-level language to assembly language and then translating the assembly language to machine code (along with some other intermediate steps). However, you never see the assembly language unless you ask to.

A register is an area of storage set aside inside the CPU itself for effectively instantaneous access. Registers serve a large variety of purposes. An x86_64 CPU has a set of at least 100 registers - whew! Fortunately, an application developer, even working in assembly language, rarely has to be concerned with more than about 20 of them at most. The majority of the registers (including the control, debug, table descriptor, performance, and machine-check registers, to name a few) are accessible only to kernel code. Most of the rest, such as the mmx, xmm, and ymm registers, are only used by vector code, and the fpr registers are only used for floating-point calculations (there are exceptions, but as a rule of thumb it’s a safe starting point). In addition, only parts of the rflags register are ever used by application code.

One of the quirks of the x86 architecture, ever since the original 16-bit 8086 instruction set, is that many of the same registers can be addressed by different names which determine which part of the register is being read or written. Most of the general-purpose registers can be addressed down to a single byte. For example, the rax (accumulator) register, the first 64-bit general-purpose register, can also be addressed as eax (the low 32 bits of rax), ax (the low 16 bits of rax), ah (the second lowest 8 bits of rax), and al (the low 8 bits of rax). This capability is useful for handling smaller data types - for example, to handle a signed 32-bit addition requires only a single add instruction based on 32-bit register names rather than several instructions designed to emulate 32-bit sign extension and integer overflow on 64-bit registers.

A register named rx is 64 bits; ex is 32 bits, *x is 16 bits, and *h or *l are 8 bits. For the r8-r15 registers, these names are instead rN (64 bits), rNd (32 bits), rNw (16 bits), and rNb (8 bits). rip and rflags can only be accessed as 64-bit registers, and the 8-bit versions of rsi, rdi, rsp, and rbp are named sil, dil, spl, and bpl.

The registers that a userland process is concerned with on a regular basis are:

  • rax, rbx, rcx, rdx, r8-r15 - These are general-purpose registers, used for just about anything at any given moment, though the ABI locks these down to considerably more specific purposes. These registers can also be called the accumulator (rax), the base register (rbx), the count register (rcx), and the data register (rdx).

  • rsi, rdi - These are technically index registers (‘source index’ and ‘destination index’), but in modern code they are typically used as general purpose registers, within the specification of the ABI.

  • rbp, rsp - The “base pointer” and “stack pointer” registers. These are used for accessing the stack; the CPU’s stack instructions will always assume that rsp holds the address of the top of the stack.

  • rflags - The flags register, holding a long list of flags indicating the results of calculations done by instructions. The flags register can not be directly addressed. Operations affected by CPU flags are generally part of the instructions themselves; for instance, conditional jump instructions work differently depending on the current flags, and arithmetic operations change the flags. Certain instructions affect the flags directly, such as stc and clc, which respectively set and clear the Carry Flag. It is also possible to read the flags register directly by pushing it to the stack and write to it directly by popping from the stack into it. The flags a userland process can affect are: CF - Carry Flag. CF is set when the result of an addition is a carry or the result of a subtraction is a borrow. It is also affected by arithmetic bit shifting instructions and bit test instructions, cleared by bitwise logic instructions, and manipulated directly by the stc, clc, and cmc instructions. PF - Parity Flag. PF is set when there are an even number of 1 bits in the low byte of the last result of some operations. It can be used for parity checks. AF - Auxiliary Carry Flag. AF is set when an arithmetic or BCD operation generates a carry or borrow from bit 3 of the result. Its use is limited to doing decimal math directly on the CPU, and it sees little use. ZF - Zero Flag. ZF is set when the last arithmetic operation had a result of zero. Compare and test instructions also set or clear ZF appropriately. It is often used as for equality testing, as it is set when comparing two equal operands. SF - Sign Flag. SF is set if the last arithmetic operation had a negative result. More exactly, after an arithmetic operation, SF is set to the value of the highest significant bit of the result. DF - Direction Flag. DF is used to control whether the string instructions increment or decrement rsi and rdi during their operation, and can be manipulated by the std and cld instructions. This flag is rarely used in modern code, as the string instructions see little use. OF - Overflow Flag. OF is set when the sign of the result of the last signed arithmetic operation is different from the signs of both source operands. This means that the result was too big or too small to hold in the destination.

  • CF - Carry Flag. CF is set when the result of an addition is a carry or the result of a subtraction is a borrow. It is also affected by arithmetic bit shifting instructions and bit test instructions, cleared by bitwise logic instructions, and manipulated directly by the stc, clc, and cmc instructions.

  • PF - Parity Flag. PF is set when there are an even number of 1 bits in the low byte of the last result of some operations. It can be used for parity checks.

  • AF - Auxiliary Carry Flag. AF is set when an arithmetic or BCD operation generates a carry or borrow from bit 3 of the result. Its use is limited to doing decimal math directly on the CPU, and it sees little use.

  • ZF - Zero Flag. ZF is set when the last arithmetic operation had a result of zero. Compare and test instructions also set or clear ZF appropriately. It is often used as for equality testing, as it is set when comparing two equal operands.

  • SF - Sign Flag. SF is set if the last arithmetic operation had a negative result. More exactly, after an arithmetic operation, SF is set to the value of the highest significant bit of the result.

  • DF - Direction Flag. DF is used to control whether the string instructions increment or decrement rsi and rdi during their operation, and can be manipulated by the std and cld instructions. This flag is rarely used in modern code, as the string instructions see little use.

  • OF - Overflow Flag. OF is set when the sign of the result of the last signed arithmetic operation is different from the signs of both source operands. This means that the result was too big or too small to hold in the destination.

  • rip - The instruction pointer register. This holds the memory address of the instruction currently being executed by the CPU. rip can be addressed directly in x86_64, but only for use as a memory offset. To write to rip, one must execute one of the many control transfer instructions. As instructions are executed, rip increases by the size of each one (instructions are of very variable size in the x86 architectures), with the exception of control transfer instructions, which work by changing the value of rip according to the transfer target.

Calling ConventionsThe calling conventions of an architecture, which are typically what people mean when they say ABI, specify the ways that functions receive parameters, return values, manage the stack, and other fundamentals not already part of the CPU architecture. x86_64’s calling conventions are somewhat complicated, so I’ll include an abbreviated version here which will get you through all of the sample code.

Conveniently, none of the functions in the sample code take non-integer parameters, or any large number of parameters. At this point, one might immediately protest that char **, NSString *, and id certainly are not integers! However, for the purpose of function parameter passing, an integer is a value that fits within the bit width of the architecture, i.e. 64 bits for x86_64. Pointers are exactly that size, while int is smaller (x86_64 is an LP64 architecture, which means that long is 64 bits, but int is 32).

Integer parameters to functions are passed via a series of registers. The first parameter goes in rdi. The second goes in rsi, the third in rdx, then rcx, r8, and r9, in that order. If there are more integer arguments than that, the remainder are pushed onto the stack in right-to-left (reverse) order.

Apart from that, the only quirk of the calling conventions we need to be concerned with for this code is the sequence for variadic functions. A variadic function is one which uses the stdarg interface to take a variable number of parameters. In this case, NSLog is the culprit. There’s only one oddity in how variadic functions take parameters, at least at the assembly language level: The byte value of al (the low 8 bits of rax) is used to specify the number of vector registers used to pass arguments to the function. Since no vector registers are used by our sample code, this number is always zero.

Finally, functions return simple integer values (again, remember that for these purposes, a pointer is an integer value) in rax, or in some rarer cases, rdx.

The complete calling conventions are considerably more complicated; if you’re curious, have a look at the AMD64 Application Binary Interface Specification.

The Assembly LanguageThe full disassembly of the program is 645 lines long. For sanity’s sake, I won’t be pasting it here. I’ll instead be following along with the code as I explain it. You can disassemble it yourself by running /usr/bin/clang -S test.m -o test.s -fobjc-arc in the directory where you compiled the sample code and viewing the test.s file. This is the compiler’s generated assembly code, which is better annotated than anything else, as the compiler doesn’t have to guess at anything’s name or location.

Looking at the disassembly, one might notice that the code contains eight functions. Eight? The sample code only has three! Where did those other five come from? Four methods are synthesized by the compiler per the @synthesize directive, and the [MyClass .cxx_destruct] method is created by the compiler to do C++- and ARC-related cleanup.

mainThe code for main is:

int main(int argc, char **argv)
{
@autoreleasepool
{
MyClass *obj = [[MyClass alloc] initWithName: @"name" number: 42];
NSString *string = MyFunction([obj name]);
NSLog(@"%@", string);
return 0;
}
}

And the compiler’s assembly language output, stripped of several confusing directives for brevity’s sake:

_main:
pushq %rbp
movq %rsp, %rbp
subq $96, %rsp
leaq L__unnamed_cfstring_23(%rip), %rax
leaq L__unnamed_cfstring_26(%rip), %rcx
movl $42, %edx
leaq l_objc_msgSend_fixup_alloc(%rip), %r8
movl $0, -4(%rbp)
movl %edi, -8(%rbp)
movq %rsi, -16(%rbp)
movq %rax, -48(%rbp) ## 8-byte Spill
movq %rcx, -56(%rbp) ## 8-byte Spill
movq %r8, -64(%rbp) ## 8-byte Spill
movl %edx, -68(%rbp) ## 4-byte Spill
callq _objc_autoreleasePoolPush
movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rcx
movq %rcx, %rdi
movq -64(%rbp), %rsi ## 8-byte Reload
movq %rax, -80(%rbp) ## 8-byte Spill
callq *l_objc_msgSend_fixup_alloc(%rip)
movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
movq %rax, %rdi
movq -56(%rbp), %rdx ## 8-byte Reload
movl -68(%rbp), %ecx ## 4-byte Reload
callq _objc_msgSend
movq %rax, -24(%rbp)
movq -24(%rbp), %rax
movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
movq %rax, %rdi
callq _objc_msgSend
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, %rdi
movq %rax, -88(%rbp) ## 8-byte Spill
callq _MyFunction
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue
movq %rax, -32(%rbp)
movq -88(%rbp), %rax ## 8-byte Reload
movq %rax, %rdi
callq _objc_release
movq -32(%rbp), %rsi
movq -48(%rbp), %rdi ## 8-byte Reload
movb $0, %al
callq _NSLog
movl $0, -4(%rbp)
movl $1, -36(%rbp)
movq -32(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release
movq -24(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release
movq -80(%rbp), %rdi ## 8-byte Reload
callq _objc_autoreleasePoolPop
movl -4(%rbp), %eax
addq $96, %rsp
popq %rbp
ret

Whew! main’s pretty long in assembly, huh? There are some important things to recognize here:

  • Per the ABI, rdi is the first argument register for integer/pointer arguments, and contains the value of argc.

  • Likewise, rsi contains the value of argv.

  • Also likewise, rdx has the value of envp. This holds true even though envp is not declared as a parameter to main!

  • Finally, rcx holds the value of a more mysterious “exec_path” parameter, whose presence I only discovered when I peeked at the disassembly of the start function, part of the C runtime.

  • And, per x86 convention, rsp points to the top of the stack. Because main is a subroutine of start, the 8 bytes pointed to by rsp are the return address for main, the next instruction in start.

Let’s take it one instruction at a time.

  • pushq %rbp - Starting off pretty simple. Save the base pointer on the stack so we can restore it later. The ABI specifies that rbp must be preserved across function calls, so since it’s about to change, it gets saved.

pushq %rbp - Starting off pretty simple. Save the base pointer on the stack so we can restore it later. The ABI specifies that rbp must be preserved across function calls, so since it’s about to change, it gets saved.

  • movq %rsp,%rbp - Copy rsp to rbp. This is part of a standard C function’s prologue, setting up the stack to hold any local variables that aren’t put in registers for whatever reason.

movq %rsp,%rbp - Copy rsp to rbp. This is part of a standard C function’s prologue, setting up the stack to hold any local variables that aren’t put in registers for whatever reason.

  • subq 96,96,%rsp - A number preceded by in assembly language is a literal decimal number used as an operand to an instruction, so this line subtracts 96 from rsp, growing the stack by 96 bytes. This is how much stack space the compiler has determined it will need for the rest of the function.

subq 96,96,%rsp - A number preceded by in assembly language is a literal decimal number used as an operand to an instruction, so this line subtracts 96 from rsp, growing the stack by 96 bytes. This is how much stack space the compiler has determined it will need for the rest of the function.

  • leaq L__unnamed_cfstring_23(%rip),%rax - Load the address of L__unnamed_cfstring_23 into rax, using rip as the base. rip-relative addressing is typically used for loading such things as constant strings and selector names, as well as for fast branches. This particular load grabs the string @”%@” from the place it was stored in the executable. This string will later be used as a method parameter.

leaq L__unnamed_cfstring_23(%rip),%rax - Load the address of L__unnamed_cfstring_23 into rax, using rip as the base. rip-relative addressing is typically used for loading such things as constant strings and selector names, as well as for fast branches. This particular load grabs the string @”%@” from the place it was stored in the executable. This string will later be used as a method parameter.

  • leaq L__unnamed_cfstring_26(%rip),%rcx - Same as above, but loading @“name” into rcx.

leaq L__unnamed_cfstring_26(%rip),%rcx - Same as above, but loading @“name” into rcx.

  • movl $42,%edx - Load the 32-bit value 42 into edx (the low 32 bits of rdx). This value is also used later.

movl $42,%edx - Load the 32-bit value 42 into edx (the low 32 bits of rdx). This value is also used later.

  • leaq l_objc_msgSend_fixup_alloc(%rip),%r8 - Grab the address of the l_objc_msgSend_fixup_alloc symbol from the Objective-C segment of the executable, and save that address in r8. Once again, this is used later.

leaq l_objc_msgSend_fixup_alloc(%rip),%r8 - Grab the address of the l_objc_msgSend_fixup_alloc symbol from the Objective-C segment of the executable, and save that address in r8. Once again, this is used later.

  • movl $0, -4(%rbp) - Load a 32-bit zero into the bottom of the stack. This serves as a useful reminder that the stack grows downwards; given that we know that %rbp points to the bottom of the stack, i.e. the highest address at which the stack exists, this line is actually setting the last four bytes of the stack to zero. So what does this actually do? As it turns out, for all intents and purposes, it does absolutely nothing! It’s the result of the compiler’s determination to make sure no garbage value gets used later, as seen in the next instruction, even though the value is never again read.

movl $0, -4(%rbp) - Load a 32-bit zero into the bottom of the stack.

This serves as a useful reminder that the stack grows downwards; given that we know that %rbp points to the bottom of the stack, i.e. the highest address at which the stack exists, this line is actually setting the last four bytes of the stack to zero.

So what does this actually do? As it turns out, for all intents and purposes, it does absolutely nothing! It’s the result of the compiler’s determination to make sure no garbage value gets used later, as seen in the next instruction, even though the value is never again read.

  • movl %edi, -8(%rbp) - Save edi, the low 32 bits of rdi, on the stack. As edi is the first integer argument register, this is actually the value of argc. The previous instruction, setting the last 32 bits of the stack to zero, now makes a bit more sense; the same effect could have been achieved by code something like *rbp = ((int64_t)argc & 0x00000000FFFFFFFF);, except that sign-extending and ANDing the value of argc would have been several more operations. Unfortunately for the unoptimizing compiler’s track record, this instruction also turns out to be useless, as the value of argc is never actually used.

movl %edi, -8(%rbp) - Save edi, the low 32 bits of rdi, on the stack. As edi is the first integer argument register, this is actually the value of argc. The previous instruction, setting the last 32 bits of the stack to zero, now makes a bit more sense; the same effect could have been achieved by code something like *rbp = ((int64_t)argc & 0x00000000FFFFFFFF);, except that sign-extending and ANDing the value of argc would have been several more operations. Unfortunately for the unoptimizing compiler’s track record, this instruction also turns out to be useless, as the value of argc is never actually used.

  • movq %rsi, -16(%rbp) - Save rsi, also known as argv at the moment, on the stack. A third useless instruction in a row, since argv isn’t used either.

movq %rsi, -16(%rbp) - Save rsi, also known as argv at the moment, on the stack. A third useless instruction in a row, since argv isn’t used either.

  • movq %rax, -48(%rbp) ## 8-byte Spill movq %rcx, -56(%rbp) ## 8-byte Spill movq %r8, -64(%rbp) ## 8-byte Spill movl %edx, -68(%rbp) ## 4-byte Spill Save rax (the string @”%@”), rcx (the string @“name”), r8 (a pointer to l_objc_msgSend_fixup_alloc) and edx (the number 42) on the stack as “spill” values. What in the world is a spill value, you might ask? A register spill takes place when the compiler needs a register to store a value in, typically as a parameter to a function call since parameters go in specific registers, and none are available. The value in the needed register is saved on the stack (“spilled”) so it can be restored (“reloaded”) later. In this case, where optimization is shut off, the compiler doesn’t have any of the data-flow analysis it would need to realize that all this spilling is unnecessary, and everything in useful registers gets spilled.
movq %rax, -48(%rbp) ## 8-byte Spill
movq %rcx, -56(%rbp) ## 8-byte Spill
movq %r8, -64(%rbp) ## 8-byte Spill
movl %edx, -68(%rbp) ## 4-byte Spill

Save rax (the string @”%@”), rcx (the string @“name”), r8 (a pointer to l_objc_msgSend_fixup_alloc) and edx (the number 42) on the stack as “spill” values.

What in the world is a spill value, you might ask? A register spill takes place when the compiler needs a register to store a value in, typically as a parameter to a function call since parameters go in specific registers, and none are available. The value in the needed register is saved on the stack (“spilled”) so it can be restored (“reloaded”) later. In this case, where optimization is shut off, the compiler doesn’t have any of the data-flow analysis it would need to realize that all this spilling is unnecessary, and everything in useful registers gets spilled.

  • callq _objc_autoreleasePoolPush - Make a subroutine call to objc_autoreleasePoolPush(). A subroutine call consists of two operations, performed atomically with respect to other instructions (i.e. they can not be preempted halfway through): Push the address of the next instruction to be executed to the stack, and execute a branch to the address of the first instruction of the called function. Since objc_autoreleasePoolPush() doesn’t take any parameters, what’s in most of the registers doesn’t matter. When it returns, however, rax contains its void * return value, a pointer which acts as an opaque handle to the position of the new autorelease pool on the pool stack. This value is invisible to the Objective-C code, which sees only the @autoreleasepool statement.

callq _objc_autoreleasePoolPush - Make a subroutine call to objc_autoreleasePoolPush(). A subroutine call consists of two operations, performed atomically with respect to other instructions (i.e. they can not be preempted halfway through): Push the address of the next instruction to be executed to the stack, and execute a branch to the address of the first instruction of the called function. Since objc_autoreleasePoolPush() doesn’t take any parameters, what’s in most of the registers doesn’t matter. When it returns, however, rax contains its void * return value, a pointer which acts as an opaque handle to the position of the new autorelease pool on the pool stack. This value is invisible to the Objective-C code, which sees only the @autoreleasepool statement.

  • movq L_OBJC_CLASSLIST_REFERENCES_(_(%rip), %rcx movq %rcx, %rdi movq -64(%rbp), %rsi ## 8-byte Reload movq %rax, -80(%rbp) ## 8-byte Spill callq *l_objc_msgSend_fixup_alloc(%rip) Load the value at rip + L_OBJC_CLASSLIST_REFERENCES__ into rcx, copy rcx into rdi, reload the address of l_objc_msgSend_fixup_alloc from the stack into rsi, spill rax (the autorelease pool handle) to the stack, and finally, make a subroutine call to l_objc_msgSend_fixup_alloc. L_OBJC_CLASSLIST_REFERENCES_$_ is the symbol for the MyClass class object. The load into rcx and then the immediate copy to rdi is once again a problem of lack of data-flow analysis; the compiler blindly picks the first available register to load the value into, then stores it in the first integer parameter register from there. What rules cause it to consider rcx the first available register? rax is still in use as a return value until the next couple of instructions, and rbx isn’t considered because its value is preserved across function calls, making it a very un-preferred register for use. So far, the MyClass class object is parameter 1. The reload from the stack pulls the pointer to l_objc_msgSend_fixup_alloc into argument 2. The spill of rax saves the autorelease pool handle, since rax will be clobbered by the subroutine return. And l_objc_msgSend_fixup_alloc is a vtable call; the address of the real alloc method will be “fixed up” at runtime for optimization purposes. This sequence therefore amounts to an optimized Objective-C message send. Recall that every Objective-C method takes two hidden arguments, self and _cmd. In this case, self is [MyClass class], and _cmd is alloc (or more exactly, a vtable pointer to a common alloc method for all classes). A very similar sequence follows.
movq L_OBJC_CLASSLIST_REFERENCES_$_(%rip), %rcx
movq %rcx, %rdi
movq -64(%rbp), %rsi ## 8-byte Reload
movq %rax, -80(%rbp) ## 8-byte Spill
callq *l_objc_msgSend_fixup_alloc(%rip)

Load the value at rip + L_OBJC_CLASSLIST_REFERENCES_$_ into rcx, copy rcx into rdi, reload the address of l_objc_msgSend_fixup_alloc from the stack into rsi, spill rax (the autorelease pool handle) to the stack, and finally, make a subroutine call to l_objc_msgSend_fixup_alloc.

L_OBJC_CLASSLIST_REFERENCES_$_ is the symbol for the MyClass class object. The load into rcx and then the immediate copy to rdi is once again a problem of lack of data-flow analysis; the compiler blindly picks the first available register to load the value into, then stores it in the first integer parameter register from there.

What rules cause it to consider rcx the first available register? rax is still in use as a return value until the next couple of instructions, and rbx isn’t considered because its value is preserved across function calls, making it a very un-preferred register for use.

So far, the MyClass class object is parameter 1. The reload from the stack pulls the pointer to l_objc_msgSend_fixup_alloc into argument 2. The spill of rax saves the autorelease pool handle, since rax will be clobbered by the subroutine return. And l_objc_msgSend_fixup_alloc is a vtable call; the address of the real alloc method will be “fixed up” at runtime for optimization purposes.

This sequence therefore amounts to an optimized Objective-C message send. Recall that every Objective-C method takes two hidden arguments, self and _cmd. In this case, self is [MyClass class], and _cmd is alloc (or more exactly, a vtable pointer to a common alloc method for all classes). A very similar sequence follows.

  • movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi movq %rax, %rdi movq -56(%rbp), %rdx ## 8-byte Reload movl -68(%rbp), %ecx ## 4-byte Reload callq _objc_msgSend Load the value at rip + L_OBJC_SELECTOR_REFERENCES_27 into rsi, copy rax to rdi, reload @“name” into rdx, reload 42 into ecx and subroutine-call to objc_msgSend. L_OBJC_SELECTOR_REFERENCES_27 is the selector for [MyClass initWithName:number:], placed into rsi, or argument 2. rax holds the return value of alloc, which is the new MyClass object, and it’s copied into argument 1. The third parameter, rdx, is loaded with the constant NSString @“name”, and the fourth parameter with the number 42. Finally, objc_msgSend() is called. This is the call sequence for [ initWithName:@“name” number
    ]. The init method will return the value of self in rax.
movq L_OBJC_SELECTOR_REFERENCES_27(%rip), %rsi
movq %rax, %rdi
movq -56(%rbp), %rdx ## 8-byte Reload
movl -68(%rbp), %ecx ## 4-byte Reload
callq _objc_msgSend

Load the value at rip + L_OBJC_SELECTOR_REFERENCES_27 into rsi, copy rax to rdi, reload @“name” into rdx, reload 42 into ecx and subroutine-call to objc_msgSend.

L_OBJC_SELECTOR_REFERENCES_27 is the selector for [MyClass initWithName:number:], placed into rsi, or argument 2. rax holds the return value of alloc, which is the new MyClass object, and it’s copied into argument 1. The third parameter, rdx, is loaded with the constant NSString @“name”, and the fourth parameter with the number 42. Finally, objc_msgSend() is called. This is the call sequence for [ initWithName:@“name” number

]. The init method will return the value of self in rax.

  • movq %rax, -24(%rbp) and movq -24(%rbp), %rax - Yes, that’s right, these two instructions are entirely redundant. Because -24(%rbp) is used later, it’s good for the value to be saved. Unfortunately, the immediate reload back into rax is not justified.

movq %rax, -24(%rbp) and movq -24(%rbp), %rax - Yes, that’s right, these two instructions are entirely redundant. Because -24(%rbp) is used later, it’s good for the value to be saved. Unfortunately, the immediate reload back into rax is not justified.

  • movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi movq %rax, %rdi callq _objc_msgSend Hopefully, you’ve got the hang of this by now; this is objc_msgSend(rax, @selector(name));. Return value in rax as usual.
movq L_OBJC_SELECTOR_REFERENCES_28(%rip), %rsi
movq %rax, %rdi
callq _objc_msgSend

Hopefully, you’ve got the hang of this by now; this is objc_msgSend(rax, @selector(name));. Return value in rax as usual.

  • movq %rax, %rdi and callq _objc_retainAutoreleasedReturnValue should be obvious now. objc_retainAutoreleasedReturnValue(obj); is inserted by ARC to keep the return value of the name method alive, since the temporary variable created invisibly by the Objective-C compiler to hold the value is implicitly declared __strong.

movq %rax, %rdi and callq _objc_retainAutoreleasedReturnValue should be obvious now. objc_retainAutoreleasedReturnValue(obj); is inserted by ARC to keep the return value of the name method alive, since the temporary variable created invisibly by the Objective-C compiler to hold the value is implicitly declared __strong.

  • movq %rax, %rdi movq %rax, -88(%rbp) ## 8-byte Spill callq _MyFunction movq %rax, %rdi callq _objc_retainAutoreleasedReturnValue Save the return value of name, copy it as the first parameter to MyFunction(), call MyFunction(), call objc_retainAutoreleasedReturnValue() on the return from it.
movq %rax, %rdi
movq %rax, -88(%rbp) ## 8-byte Spill
callq _MyFunction
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue

Save the return value of name, copy it as the first parameter to MyFunction(), call MyFunction(), call objc_retainAutoreleasedReturnValue() on the return from it.

  • movq %rax, -32(%rbp) movq -88(%rbp), %rax ## 8-byte Reload movq %rax, %rdi callq _objc_release Save the return value of MyFunction(). Then, reload the result of [MyClass name], and call objc_release() on it, as ARC has noticed that it’s no longer used.
movq %rax, -32(%rbp)
movq -88(%rbp), %rax ## 8-byte Reload
movq %rax, %rdi
callq _objc_release

Save the return value of MyFunction(). Then, reload the result of [MyClass name], and call objc_release() on it, as ARC has noticed that it’s no longer used.

  • movq -32(%rbp), %rsi movq -48(%rbp), %rdi ## 8-byte Reload movb $0, %al callq _NSLog A simple call to NSLog(), with the only odd feature being the set of al to zero. Because NSLog() is a variadic function, the calling convention specifies that al holds the number of vector registers used when calling it. No vector registers are used, so it’s just set to zero.
movq -32(%rbp), %rsi
movq -48(%rbp), %rdi ## 8-byte Reload
movb $0, %al
callq _NSLog

A simple call to NSLog(), with the only odd feature being the set of al to zero. Because NSLog() is a variadic function, the calling convention specifies that al holds the number of vector registers used when calling it. No vector registers are used, so it’s just set to zero.

  • movl 0,4(0, -4(%rbp) and movl 1, -36(%rbp) - I have to admit, I see no reason whatsoever for the compiler to toss a zero and a one onto what look rather like random parts of the stack, here or anywhere else in main(). Nothing like these values is used anywhere in the optimized version of the code. The store of zero at least gets used further down, but the store of a 1 seems entirely meaningless.

movl 0,4(0, -4(%rbp) and movl 1, -36(%rbp) - I have to admit, I see no reason whatsoever for the compiler to toss a zero and a one onto what look rather like random parts of the stack, here or anywhere else in main(). Nothing like these values is used anywhere in the optimized version of the code. The store of zero at least gets used further down, but the store of a 1 seems entirely meaningless.

  • movq -32(%rbp), %rdx movq %rdx, %rdi callq _objc_release Release the return value of MyFunction() - you have set aside a sheet of paper to keep track of which values went in which offsets on the stack and in which registers, haven’t you? If not, it’d be little wonder if you were a little lost by now.
movq -32(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release

Release the return value of MyFunction() - you have set aside a sheet of paper to keep track of which values went in which offsets on the stack and in which registers, haven’t you? If not, it’d be little wonder if you were a little lost by now.

  • movq -24(%rbp), %rdx movq %rdx, %rdi callq _objc_release Now release obj, the object of class MyClass that we allocated before.
movq -24(%rbp), %rdx
movq %rdx, %rdi
callq _objc_release

Now release obj, the object of class MyClass that we allocated before.

  • movq -80(%rbp), %rdi ## 8-byte Reload callq _objc_autoreleasePoolPop Reload the autorelease pool handle and pop it by calling objc_autoreleasePoolPop(). This is the code inserted by the closing brace } of the @autoreleasepool statement.
movq -80(%rbp), %rdi ## 8-byte Reload
callq _objc_autoreleasePoolPop

Reload the autorelease pool handle and pop it by calling objc_autoreleasePoolPop(). This is the code inserted by the closing brace } of the @autoreleasepool statement.

  • movl -4(%rbp), %eax addq $96, %rsp popq %rbp ret Load the zero on the stack into eax as main’s return value. Restore the stack pointer to its original position when main was called. Pop the original value of rbp off the stack and back into rbp. Pop the address of the next instruction off the stack into rip, also known as returning from a subroutine call.
movl -4(%rbp), %eax
addq $96, %rsp
popq %rbp
ret

Load the zero on the stack into eax as main’s return value.

Restore the stack pointer to its original position when main was called.

Pop the original value of rbp off the stack and back into rbp.

Pop the address of the next instruction off the stack into rip, also known as returning from a subroutine call.

And that’s main()! What a long-winded mess.

I must admit at this point that I went out of my way to make this function difficult to understand in one critical respect: I’ve been working from the unoptimized version of the code generated by the compiler. The code built with -Os is, surprisingly, much easier to understand, with a lot of redundant work completely eliminated and the registers managed much more efficiently. There’s also almost no work done on the stack, since the compiler in optimizing mode is free to make use of a larger pool of scratch registers.

I did this because until you can understand the control flow of an unoptimized routine, there’s no point in reading optimized code. Starting with the optimized code is a bit like learning to swim in water so shallow you can’t even put your face under, except for those times when the compiler does something fantastically tricky to get a speed or size bonus, when it suddenly becomes rather like diving into the deep end of an Olympic pool.

ConclusionThat’s the end of this article, but it’s only part 1 in a series. Hopefully, you’ve enjoyed it so far; in part 2 I’ll explore the rest of the methods in the sample code, as well as the optimized version of the code and the C runtime’s start function.