反汇编剖析(三)ARM 篇

Mike Ash Friday Q&A 中文译文:反汇编剖析(三)ARM 篇

作者 TommyWu
封面圖片: 反汇编剖析(三)ARM 篇

译文 · 原文: Friday Q&A 2011-12-30: Disassembling the Assembly, Part 3: ARM edition · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2011-12-30-disassembling-the-assembly-part-3-arm-edition.html 发布:2011-12-30 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


Gwynne 以探讨 ARM 汇编作为她汇编代码分析系列的收尾,以满足你所有 iOS 开发需求。今后 Gwynne 还将以客座作者身份偶尔撰写文章,届时将不由我进行引言介绍。请留意文章顶部的「作者」栏以确认撰稿人身份。闲言少叙,让我们直接审视 ARM 架构。

自从我撰写本系列第一篇关于阅读 x86_64 汇编语言的文章以来,已收到多封希望讨论 ARM 汇编语言阅读方法的请求 —— 毕竟 ARM 正是运行 iOS 设备的指令集架构。遗憾的是,在收到这些请求时,我对 ARM 几乎一无所知!为避免令人失望,我随即开启了关于该指令集的速成课程。

所幸 ARM 架构并非想象中那般复杂;它更像是在已掌握的语言基础上学习新的方言,而非从零开始学习全新语言。为便于理解,本文将假定你已阅读过第一、二部分,仅着重阐述 ARM 架构的特有差异,不再重复基础概念。

我对 ARM 架构的理解很可能至少存在一处错误,我非常乐意接受大家可能提出的任何解释与修正。

CPU 与 ABI 不幸的是,ARM 规范的获取难度比 x86_64 要大一些;你必须访问他们的网站并注册账户,然后才能下载 PDF 文件。好消息是,这并非付费墙,且注册过程相当简单。如果你愿意费心,请在 “ARM Architecture” 下查找文档 “ARMv7-AR Architecture Reference Manual (Issue C)”,以及在 “ARM software development tools” 下查找文档 “Base Platform ABI for the ARM Architecture” 和 “Procedure Call Standard for the ARM Architecture”。苹果关于 iOS 所用 ARM ABI 的文档是公开的,并且提供了关于该 ABI 几乎所有通常需要了解的信息。

我发现有点困惑的是,ARM 的主要文档分散在数十份文档中,而 x86_64 的文档总共只有六份 PDF。虽然整个 x86_64 体系下的文档数量其实并不少,但进行应用程序编程时你只需要其中两份(如果涉及 SIMD(单指令多数据流)工作则最多三份),而且这些文档的标识非常清晰。对于 ARM 而言,要获取同等数量的信息至少需要三份文档,除非你已经对所使用的平台和高级语言有相当多的了解,否则这些文档找起来要麻烦得多。

ARM 的多种变体

在我谈论 ARM 如何实现其功能的细节之前,我必须先解释一下关于这个「architecture(架构)」的事情:它至少有十几种变体。与 x86 不同,「ARM(ARM 架构)」从一开始就设计用于多种环境,所有这些环境对速度、功耗和效率都有不同的要求。它还经历了多次重大修订,并拥有大量可选特性。在本文中,我将重点讨论现代 iPhone、iPad 和 iPod Touch 所使用的特定 ARM 变体:「ARMv7 architecture(ARMv7 架构)」,「Application profile(应用配置文件)」,带有「Thumb 2(Thumb 2 指令集)」和「NEON SIMD(NEON SIMD 扩展)」。

这意味着什么?这意味着我将基于 ARM 指令集的第 7 版工作(目前有 8 个修订版,但 ARMv8 尚未在任何出货的处理器中实现)。我将专注于应用配置文件(Application profile),它适用于典型的操作系统,而不是实时配置文件(Real-time profile,适用于小型 —— 你猜对了!—— 实时系统)或移动配置文件(Mobile profile,适用于嵌入式系统)。ARMv6 用于第一代和第二代 iDevices(除了 iPad),大体相似,但并非完全相同。这也意味着我将使用 Thumb 指令集(Thumb instruction set),因为它被苹果强烈推荐用于 iOS。Thumb ARM 架构指定了一个名为 Thumb 的次级指令集。Thumb 的目的相当简单:用更小的指令完成所有常见任务。尽管 ARM 指令每条固定为 32 位,但 Thumb 指令(极少数例外)每条为 16 位,从而生成更小的代码。正因为如此,在 iOS 的大部分生命周期中,苹果一直推荐使用 Thumb 构建 iOS 代码,尽管他们的工具链在构建支持浮点的 ARMv6 Thumb 时以创建崩溃代码而闻名。

在 ARMv6 架构下,阅读 Thumb 和 ARM 汇编代码简直是种折磨,因为它们根本就不是同一种指令集。然而到了 ARMv7 时代,ARM 开发出了「统一汇编语言」(Unified Assembly Language,UAL),它用单一的(没错!)统一助记词集合覆盖了 ARM 和 Thumb 指令的所有操作。这也是我坚持使用 ARMv7 的另一个原因。

注:同样是在 ARMv7 时代,Thumb(更具体地说是 Thumb 2)才获得了对浮点扩展(floating-point extensions)的正式支持 —— 这也是此前 Apple 编译器总是出问题的原因之一。(译注:现代 ARMv8-A 架构已全面采用 A64 指令集,Thumb 与 ARM 的区分已逐步淡化。)

Thumb 的一个特别之处在于如何在任何时刻告知 CPU 正在运行的指令集。这通过一组 “交互分支指令”(interworking branch instructions)来解决;当这类指令将程序分支跳转到寄存器存储的地址时,它会将地址的最低有效位(最后一位)解释为标志,指示 CPU 切换到哪种模式。然而当分支跳转到硬编码地址时,指令集标志则会直接反转。因此,bx #12345指令在当前指令集架构(ISA)为 ARM 时会分支跳转到 Thumb,若为 Thumb 则跳转到 ARM;而bx r5指令会根据 r5 寄存器的最低位设置当前指令集(位为 1 时分支到 Thumb,为 0 时分支到 ARM)。

但请记住,所有这些复杂的交互切换机制仅在程序主动调用时才发生。普通的bbl指令不会执行任何这类操作。是否要在任何分支跳转时切换指令集,完全取决于您自己!

如果这段内容让你感到困惑,请不必担心;得益于 UAL(统一汇编语言),在几乎所有情况下,Thumb 和 ARM 指令的含义是相同的,因此掌握汇编语言的阅读方法就是你需要的大部分技能。需要特别记住的是,大多数 Thumb 指令只能访问有限的一组寄存器,并且立即数取值范围要小得多。

寄存器
ARM 架构为应用程序指定了十六个通用寄存器和一个状态寄存器。NEON 进一步定义了十六个 128 位向量寄存器,这些寄存器与早期 VFP 规范使用的三十二个双精度浮点寄存器组重叠(而后者又与三十二个单精度寄存器组重叠)。具有多个名称的寄存器可以通过任一名称引用,结果相同,但按惯例,在特定情况之外,总是优先使用名称(如 lr)而非编号(如 r14)。在这十五个寄存器中,有几个被保留用于特定目的。

  • r0-r3 — 前四个 GPR(通用寄存器)用于参数传递和返回值,但在函数内可自由用作临时存储。

  • r4-r6、r8、r10、r11 - 接下来的三个通用寄存器(GPRs)以及 r8、r10、r11 被记录为在函数调用期间会被保留,但除此之外可供使用。

  • r7 - 在 iOS 上,r7 是帧指针(frame pointer),类似于 x86_64 架构中的 rbp。ARM 架构本身并未指定 r7 的用途;这是 iOS 特有的。确切地说,ARM 规范指定 r11 作为帧指针,但苹果在 iOS 上选择不使用该寄存器,并避免使用 fp 这个名称,以免其汇编器与其他 ARM 实现不兼容。

  • r9 - 在 iOS 2.x 上,r9 是操作系统用于未指定目的的特殊寄存器,不得修改。在 iOS 3.0 及以上版本中,r9 可自由使用,且无需在函数调用期间保留。

  • r12,ip - r12 是” 过程内临时寄存器”(intra-procedure scratch register)。在调用之间,它具有与 r9 相同的语义,无需保留。它也被称为 ip(不要与 x86_64 的 rip 寄存器混淆 —— 两者并不相似),并被用作计算长跳转目标地址的用途。

  • r13,sp - r13,也称为 sp,是栈指针(stack pointer)。它与 x86_64 上的 rsp 用途相同,并且工作方式也大致相同。

  • r14, lr - 链接寄存器(link register)由 blblx 指令加载,这些指令用于进行子程序调用(类似于 x86_64 上的 call 指令)。ARM 专门使用一个寄存器来存储调用的返回地址,而不是依赖栈来保存它。这意味着子程序可以至少在理论上完全不触及栈,这是实时和嵌入式系统中的一个重要优势。

  • r15, pc - 程序计数器(program counter)保存下一条要执行的指令的地址。它是 x86_64 的 rip 寄存器的对应物。一些非分支指令可以直接访问 pc,但除了非常特定的情况外,这是非常不鼓励的(例如,如下文我们将看到的,在函数序言中将 lr 压栈,在函数尾声中将值弹出到 pc,这是一种常见的子程序返回技术)。

  • q0-q15 - 一组十六个 128 位向量寄存器(vector registers),用于参数传递、结果值和临时计算。其中,q4-q7 在函数调用之间是保留的。

  • d0-d31:一组三十二个 64 位双精度浮点寄存器。这些寄存器映射到对应的向量寄存器上,使得 d0d1 分别是 q0 的低 64 位和高 64 位,d2d3 分别是 q1 的低 64 位和高 64 位,以此类推。

  • s0-s31:一组三十二个 32 位单精度浮点寄存器。这些寄存器映射到对应的双精度寄存器上,使得 s0s1 分别是 d0 的低 32 位和高 32 位,依此类推。这也意味着 s0-s3q0 的四个 32 位分量。

  • cpsr - CPU 的状态寄存器,部分等效于 x86_64 架构中的 rflags。它包含以下位标志,其中多数在函数调用期间不会被保留:N - 负标志,即运算结果的符号位。Z - 零标志,即结果是否等于零。C - 进位标志,即加法是否产生进位或减法是否发生借位。V - 溢出标志,即运算结果是否超出目标寄存器范围。Q - 饱和标志,即运算结果是否发生饱和;该标志用于在溢出时执行饱和操作的指令(即结果设为全 1,而非整数范围内循环)。GE - 大于等于标志,用于并行算术指令,可同时表示多个加法或减法运算的结果。E - 字节序。ARM 架构支持运行时切换 CPU 的字节序模式,但 iOS 规定此位必须始终为零以保持小端模式。T - Thumb 模式。该状态标志决定当前执行的是 Thumb 指令集还是 ARM 指令集,不能直接修改,其保留方式取决于上下文环境。(另有 Jazelle 模式的 J 标志,但 iOS 未实现或使用该标志。)所有其他位均为系统级标志,只能通过特权代码访问。(译注:现代 ARM 架构及 iOS 系统可能已有演进,建议查阅最新技术文档。)

  • N - 负数标志,即运算结果的符号位。

  • Z - 零标志,即结果是否等于零。

  • C - 进位标志,即加法是否产生进位或减法是否发生借位。

  • V - 溢出标志,即运算结果是否溢出目标寄存器。

  • Q - 饱和标志,即运算是否导致饱和处理;用于在溢出时进行饱和运算的指令(即置所有位为 1,而非在整数范围内回绕)。

  • GE - 大于或等于标志,被并行算术指令用于同时表示多次加法或减法运算的结果。

  • E - 端序。ARM 架构支持在运行时切换 CPU 的端序模式,但 iOS 规定此位必须始终为零以保持小端模式。

  • T - Thumb 标志。这是决定当前执行的是 Thumb 还是 ARM 代码的状态标志。它不能被直接修改,其值的保留取决于上下文。((译注:还有用于 Jazelle 模式的 J 标志位,但 iOS 并未实现或使用它。))

  • 所有其他位均为系统级,只能由特权代码访问。

调用约定
实际上,这里我不需要多说;所有内容都在寄存器的说明中。r0-r3 用于整数参数传递与返回,s/d/q0-qN 用于浮点参数传递与返回,而任何无法放入寄存器的数据都会溢出到栈上。如往常一样,这里有所简化,但足以说明问题!

另外,ARM 的内存模型在所有实际用途中与 x86_64 相同,仅内存空间更小。

终于开始反汇编
架构讨论到此为止,让我们来看一些汇编语言。我继续使用与第 1、第 2 部分相同的代码,并采用以下命令进行编译:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/clang -S test.m -fobjc-arc -arch armv7 -miphoneos-version-min=5.0 -isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.0.sdk -mthumb -Os

这就是使用 iOS SDK 的 Clang 编译器,显示汇编输出,采用 ARC 编译,面向 ARMv7 架构,要求 iOS 5(为保证链接正常所必需),基于 iOS SDK 构建(ARM 编译必需),使用 Thumb 指令集,并针对体积大小进行优化。以下为主函数生成的结果:

.thumb_func _main
_main:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r8, [sp, #-4]!
blx _objc_autoreleasePoolPush
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
mov r8, r0
movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_0:
add r1, pc
movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_1:
add r0, pc
ldr r1, [r1]
ldr r0, [r0]
blx _objc_msgSend
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
movs r3, #42
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
LPC8_2:
add r1, pc
ldr r1, [r1]
movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4))
movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4))
LPC8_3:
add r2, pc
blx _objc_msgSend
mov r5, r0
movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
LPC8_4:
add r0, pc
ldr r1, [r0]
mov r0, r5
blx _objc_msgSend
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r4, r0
mov r0, r4
bl _MyFunction
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r6, r0
mov r0, r4
blx _objc_release
movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4))
mov r1, r6
movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4))
LPC8_5:
add r0, pc
blx _NSLog
mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0
movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
vcvt.f64.f32 d16, s0
LPC8_6:
add r0, pc
vmov r1, r2, d16
blx _NSLog
mov r0, r6
blx _objc_release
mov r0, r5
blx _objc_release
mov r0, r8
blx _objc_autoreleasePoolPop
movs r0, #0
ldr r8, [sp], #4
pop {r4, r5, r6, r7, pc}

呼!对于一个为精简指令集专门设计的函数来说,这段代码可够长的 —— 七十二行。实际上编译后的机器码相当短小;只是生成该代码的汇编器写起来比较冗长,这是 RISC 架构的典型特征。

关于 ARM 汇编器语法有几点速记:

  • ARM 指令中的操作数顺序与 x86_64 汇编器使用的” GAS”(GNU 汇编器)语法相反。典型指令格式为” 助记符 目标, 操作数 1, 操作数 2”。讽刺的是,这更接近原始的 Intel 汇编器语法。

  • 立即数(immediate)操作数用 #而非 $ 分隔。

  • 寄存器名称不需要添加 % 前缀。

让我们看看能从 main 函数中解读出什么:

  • .thumb_func _main _main: push {r4, r5, r6, r7, lr} add r7, sp, #12 str r8, [sp, #-4]!

.thumb_func 是一个汇编器指令(assembler directive),它告知 clang 对这个名为 _main 的函数使用 Thumb 指令集(Thumb ISA)。然后我们看到了该函数的符号。ARM 的 push 指令可以一次性接受一个寄存器列表,将它们保存到栈上。因此 main 函数正在保存 r4 到 r6(以便将它们用作临时寄存器(scratch))、r7(旧的帧指针(frame pointer))以及 lr(调用函数的返回地址)。必须保存 lr,因为 main 进行的任何子程序调用都会覆盖它的值 —— 它不会在函数调用间被保留。不仅如此,iOS(以及任何其他使用帧指针的实现)还要求将返回地址作为函数栈帧(stack frame)的一部分放在栈上。iOS 的应用程序二进制接口(ABI)指出,未能设置栈帧” 可能阻止调试和性能工具生成有效的回溯(backtrace)”。

接下来,我们将栈指针(sp)加 12,然后将结果存入 r7,以创建帧指针。请注意,这并未修改 sp 本身!

最后一条指令相当巧妙:它将 sp 减去 4,保存 sp 的新值,并将 r8 中的值写入 sp 指向的内存地址。这样做的最终结果是将 r8 压入栈。编译器没有使用另一条 push 指令,是因为针对 r8 的 str 指令编码在 Thumb 模式下更小(16 位,而将其压栈需要 32 位的 Thumb 指令)。

.thumb_func _main
_main:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r8, [sp, #-4]!

.thumb_func 是一个汇编器指令,用于指示 clang 为指定函数名使用 Thumb 指令集架构(ISA)。接着是函数的符号名。ARM 的 push 指令可以一次性将多个寄存器列表保存到栈上,因此 main 函数保存了 r4 到 r6 以便用作临时寄存器,r7(旧的栈帧指针)以及 lr(调用方函数的返回地址)。必须保存 lr,因为 main 调用的任何子程序都会覆盖该值 —— 它不会跨函数调用而被保留。不仅如此,iOS(以及任何其他使用栈帧指针的实现)要求返回地址必须位于栈上,作为函数栈帧的一部分。iOS 的 ABI 规定,未能建立栈帧” 可能阻止调试和性能工具生成有效的回溯信息”。接下来,我们将栈指针加上 12,并将结果存入 r7 以创建栈帧指针。请注意,这并不会修改 sp 本身!最后一条指令相当巧妙:它将 sp 减去 4,保存 sp 的新值,并将 r8 中的值写入 sp 指向的内存地址。这样做的净效果是将 r8 压入栈。编译器没有使用另一条 push 指令,是因为对于 r8 寄存器,str 指令的编码更小(在 Thumb 模式下只需 16 位,而 push 它需要一条 32 位的 Thumb 指令)。

  • blx _objc_autoreleasePoolPush
    mov r8, r0

带交互切换的分支链接至 _objc_autoreleasePoolPush。这只是一个子程序调用,附带切换至 ARM 指令集(ARM ISA)的选项。该函数不接收参数,其返回结果将存放在 r0 寄存器中。随后,该结果被保存到 r8 寄存器。(译注:此处 r8 被用作临时存储 autorelease 池句柄。)是的,我知道我在这里打乱了指令的阅读顺序,但编译器出于某种原因将它们分开了,并且按照概念顺序而非汇编顺序来呈现对代码流程并无影响。

blx _objc_autoreleasePoolPush
mov r8, r0

带交互工作(interworking)的分支并链接(Branch and link)到 objc_autoreleasePoolPush。这仅是一个子程序调用,带有切换到 ARM ISA(指令集架构)的选项。该函数不接受参数,并将结果返回到 r0 寄存器中。结果被保存在 r8 中。是的,我知道我这里的指令顺序与原文不符,但编译器出于某种原因分开了它们,而按概念顺序而非汇编顺序展示它们对代码流程没有影响。

  • movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
  • movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
  • movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
  • LPC8_0:
  • add r1, pc
  • movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
  • LPC8_1:
  • add r0, pc
  • ldr r1, [r1]
  • ldr r0, [r0]
  • blx _objc_msgSend

欢迎来到在 ARM 上调用 Objective-C selector(选择子)的世界!这段代码的含义是:首先,取标签 L_OBJC_SELECTOR_REFERENCES_25 的地址,减去标签 LPC8_0+4 的地址,然后选择结果的低 16 位,将其写入 r1 的低 16 位。接着,取同一地址的高 16 位,将其写入 r1 的高 16 位。这个过程分两步完成,因为 Thumb 编码中没有用于将 32 位立即数加载到寄存器的指令。然后,将 pc(程序计数器)加到 r1 上,形成一个 PC 相对地址。这种用法应该让你联想到 x86_64 的 RIP 相对寻址。接下来,对 L_OBJC_CLASSLIST_REFERENCES_$_ 进行同样的操作,将其加载到 r0 中。两条 ldr 指令将两个寄存器中地址所指向的值读入寄存器自身,这等价于 C 代码 a = *a;。最后,代码执行一条带链接和交互工作(interworking)的分支指令跳转到 objc_msgSend,从而完成方法调用。恭喜,你刚刚学会了 ARM 如何调用 [MyClass alloc]

在 ARM 上,x86_64 中所做的虚表(vtable)优化技巧并不存在,因为 ARM 的寻址和分支模式使得直接派发(direct dispatch)并不比子程序调用更高效。

movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_0:
add r1, pc
movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_1:
add r0, pc
ldr r1, [r1]
ldr r0, [r0]
blx _objc_msgSend

欢迎来到在 ARM 上调用 Objective-C 选择子(selector)!这段代码的含义是:获取标签(label)L_OBJC_SELECTOR_REFERENCES_25 的地址,减去标签(label)LPC8_0 + 4 的地址,然后选择结果的低 16 位,将其写入 r1 的低 16 位。接着,获取同一地址的高 16 位,并将其写入 r1 的高 16 位。这分两步完成,因为 Thumb(ARM 的一种指令集)编码没有用于 32 位立即数寄存器加载的指令。然后,pc(程序计数器)被加到 r1,形成一个 PC 相对地址(pc-relative address)。这种用法应该让人想起 x86_64 的 RIP 相对寻址(rip-relative addressing)。对于 L_OBJC_CLASSLIST_REFERENCES_$_ 也执行相同操作,将其加载到 r0。两条 ldr(加载指令)指令从两个寄存器中的地址读取值到寄存器本身,相当于 C 代码 a = *a;。最后,代码执行一个带交互的分支和链接(branch and link)指令到 objc_msgSend,完成方法调用。恭喜,你刚刚学会了 ARM 如何调用 [MyClass alloc]!在 x86_64 上使用的 vtable(虚函数表)技巧在 ARM 上不存在,因为 ARM 的寻址和分支模式不会使直接派发(direct dispatch)比子程序调用更高效。

  • movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) LPC8_2: add r1, pc ldr r1, [r1] movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4)) movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4)) LPC8_3: add r2, pc movs r3, #42 blx _objc_msgSend mov r5, r0
    我再次稍微重新排列了指令流,使其更易于理解。这次调用实际上与第一次并无不同。alloc 的结果已经在 r0 寄存器中。选择子(selector)-initWithName:number: 被加载到 r1,字符串 @"name" 被加载到 r2,立即数 42 被放入 r3。方法调用(method call)的结果被保存在 r5 中。
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
LPC8_2:
add r1, pc
ldr r1, [r1]
movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4))
movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4))
LPC8_3:
add r2, pc
movs r3, #42
blx _objc_msgSend
mov r5, r0

我再次略微调整了指令流的顺序以便更清晰地理解。这次调用与第一次实际上毫无区别。alloc 的结果已经在 r0 中。-initWithName:number: 选择子(selector)被加载到 r1,字符串 @"name" 加载到 r2,立即数 42 加载到 r3。方法调用的结果保存在 r5 中。

  • movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
  • movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
  • LPC8_4: add r0, pc
  • ldr r1, [r0]
  • mov r0, r5
  • blx _objc_msgSend

这是 [obj name] 方法调用。这里没有什么特别的。

movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
LPC8_4:
add r0, pc
ldr r1, [r0]
mov r0, r5
blx _objc_msgSend

这是对 [obj name] 的方法调用。这里没什么特别之处。

  • mov r7, r7 @ 用于标记 objc_retainAutoreleasedReturnValue 的标记点
    blx _objc_retainAutoreleasedReturnValue
    mov r4, r0
    感知 ARC 的编译器(ARC-aware compiler)在指令流中插入了一个有效的空操作指令(将一个寄存器移动到自身),以便 Objective-C 运行时中的特定代码能检测到对 objc_retainAutoreleasedReturnValue 的调用,然后才真正进行该调用。由于返回值(在 r0 中)与预期的第一个参数(也在 r0 中)匹配,因此无需设置任何寄存器。函数的返回结果,即被 retain 的 [obj name],被保存在 r4 中。
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r4, r0

ARC(自动引用计数)感知的编译器在指令流中插入了一个实质上的空操作(将寄存器移动到自身),以便 Objective-C Runtime 中的某些代码能够检测到对 objc_retainAutoreleasedReturnValue 的调用,然后再实际执行该调用。由于 r0 中的返回值与预期同样位于 r0 的第一个参数相匹配,因此无需设置任何寄存器。函数的结果,即被保留的 [obj name],被保存在 r4 中。

- mov r0, r4
bl _MyFunction
mov r7, r7 @ 用于 objc_retainAutoreleaseReturnValue 的标记
blx _objc_retainAutoreleasedReturnValue
mov r6, r0

来自 r4 的值被不必要地重新加载到 r0 中!编译器在这里的优化似乎出现了失误,我不确定原因何出。无论如何,它随后调用了 MyFunction,并执行了另一段 objc_retainAutoreleasedReturnValue 代码段,将最终结果(即字符串)保存在 r6 中。

mov r0, r4
bl _MyFunction
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r6, r0

r4 寄存器中的值被重新加载到 r0 寄存器,这实际上是多余的!编译器在这里的优化似乎出了错,我也不确定具体原因。无论如何,随后代码调用了 MyFunction,并执行了另一个 objc_retainAutoreleasedReturnValue(自动释放对象保留返回值)操作,将最终结果(即字符串)保存在 r6 寄存器中。

  • mov r0, r4 blx _objc_release [obj name] 对象已不再需要,因此将其释放。
mov r0, r4
blx _objc_release

[obj name] 不再需要,因此释放它。

  • movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5 + 4)) movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5 + 4)) LPC8_5: add r0, pc mov r1, r6 blx _NSLog 再次稍微重排指令后,这是将 @”%@” 加载到 r0 中,将字符串加载到 r1 中,并调用 NSLog。可变参数函数(variadic functions)在 ARM 上似乎没有特殊语义。
movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4))
movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4))
LPC8_5:
add r0, pc
mov r1, r6
blx _NSLog

再次略微调整了指令顺序后,这是一条将 @"%@" 加载到 r0、将字符串加载到 r1,并调用 NSLog 的指令序列。可变参数函数在 ARM 上似乎没有特殊的语义。

mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0

这里将 1.0 的 32 位十进制表示加载到了 r0—— 一个整数寄存器。为什么?因为它正好放得下,而且可能的情况下,整数寄存器比向量或浮点寄存器更高效。MyFPFunction 被调用,并且结果(再次在 r0 中!)被加载到 s0 浙点寄存器。注意:mov 指令上的 .w 后缀意味着即使本应选择 16 位编码,也要为该指令使用 32 位的 Thumb 编码。对于这条特定的指令来说,这可能是不必要的,但即使会自动选择 32 位编码,使用这种标注也被认为是良好的形式。

mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0

1.0 的 32 位十进制表示被加载到整数寄存器 r0 中。为什么呢?因为它刚好能装下,而且只要可能,整数寄存器比向量或浮点寄存器更高效。随后调用了 MyFPFunction,其结果(同样在 r0 中!)被加载到 s0 浮点寄存器。注意:mov 指令上的 .w 后缀意味着即使原本可能选择 16 位编码,也强制使用 32 位 Thumb 编码。对于这条具体指令,这可能并非必要,但即使系统会自动选择 32 位编码,加上这个注解仍被视为良好的编写习惯。

  • movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
  • movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
  • LPC8_6: add r0, pc
  • vcvt.f64.f32 d16, s0
  • vmov r1, r2, d16
  • blx _NSLog

@"%f" 加载到 r0。将 s0 中的单精度浮点值转换为 d16 中的双精度浮点值,遵循可变参数函数参数列表(variadic function parameter lists)的 C 类型提升规则(C type promotion rules)(float 会被转换为 double)。将 d16 中的双精度值加载到两个整数寄存器 r1r2 中,然后调用 NSLog。如果有可用的整数寄存器,通常优先使用它们来传递浮点参数(floating-point arguments)。

movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
LPC8_6:
add r0, pc
vcvt.f64.f32 d16, s0
vmov r1, r2, d16
blx _NSLog

@"%f" 加载到 r0 寄存器中。根据 C 语言可变参数函数参数列表的类型提升规则(float 会转换为 double),将 s0 寄存器中的单精度浮点值转换为 d16 寄存器中的双精度浮点值。将 d16 中的双精度值加载到两个整数寄存器 r1 和 r2 中,然后调用 NSLog。如果整数寄存器可用,将浮点参数传递到整数寄存器中总是更优的选择。

  • mov r0, r6 blx _objc_release mov r0, r5 blx _objc_release mov r0, r8 blx _objc_autoreleasePoolPop 释放字符串,释放 objc 对象,弹出自动释放池。
mov r0, r6
blx _objc_release
mov r0, r5
blx _objc_release
mov r0, r8
blx _objc_autoreleasePoolPop

释放字符串,释放 objc,弹出自动释放池。

  • movs r0, #0
    ldr r8, [sp], #4
    pop {r4, r5, r6, r7, pc}

将 0 加载为 main 的返回值,将 r8 从栈中弹出(记住,在 Thumb 模式下无法通过 pop 指令高效寻址 r8),并恢复所有保存的寄存器。注意,虽然我们在 main 开始时保存了 lr,但现在我们将该值弹出到 pc 中。这实际上执行了子例程的互工作返回(pop into pc 被明确记录为互工作分支)。

movs r0, #0
ldr r8, [sp], #4
pop {r4, r5, r6, r7, pc}

将 0 作为 main 函数的返回值加载到寄存器,将 r8 从栈中弹出(注意,在 Thumb 模式下 r8 无法被 pop 指令高效寻址),并恢复所有保存的寄存器。请注意,虽然我们在 main 函数开头保存了 lr(链接寄存器),但现在将该值弹出到 pc(程序计数器)中。这实际上实现了子例程的交互返回(将值弹入 pc 在 ARM 架构中被明确定义为交互分支)。

以上就是 main 函数的全部内容。

MyFPFunction 让我们快速查看一下浮点操作:

.thumb_func _MyFPFunction
_MyFPFunction:
vmov.f32 s2, #5.000000e-01
vldr.32 s0, LCPI7_0
vmov s4, r0
vadd.f32 d16, d2, d1
vadd.f32 d0, d16, d0
vmov r0, s0
bx lr
LCPI7_0:
.long 3197737370
  • 声明一个 Thumb 函数 MyFPFunction。

  • vmov.f32 s2, #5.000000e-01 vldr.32 s0, LCPI7_0 vmov s4, r0 将单精度浮点立即数值 5.0 加载到寄存器 s2 中。将 LCPI7_0 处的单精度浮点值(-0.3)加载到寄存器 s0 中。将存储在寄存器 r0 中的单精度浮点值加载到寄存器 s4 中。

vmov.f32 s2, #5.000000e-01
vldr.32 s0, LCPI7_0
vmov s4, r0

将立即数单精度浮点值 5.0 载入寄存器 s2。将 LCPI7_0 地址处的单精度浮点值 (-0.3) 载入寄存器 s0。将存储在 r0 中的单精度浮点值载入寄存器 s4。

  • vadd.f32 d16, d2, d1 vadd.f32 d0, d16, d0 vmov r0, s0 使用单精度数学运算,首先将 d1 和 d2 相加并存入 d16(记住 s0 对应 d0,s2 对应 d1,s4 对应 d2)。然后将 d0 和 d16 相加并存入 d0。最后,将单精度运算结果存回寄存器 r0。
vadd.f32 d16, d2, d1
vadd.f32 d0, d16, d0
vmov r0, s0

使用单精度数学运算,先将 d1 和 d2 相加存入 d16(注意 s0 对应 d0,s2 对应 d1,s4 对应 d2)。接着将 d0 和 d16 相加结果存入 d0。最后将单精度运算结果存回 r0。

  • bx lr 带交互功能地跳转到链接寄存器。该函数如此简洁,以至于在优化模式下编译时既无序幕(prologue)也无尾声(epilogue),且不使用栈帧。这对 ARM 代码是显著优化(在 x86_64 上则不会有此优势)。由于 lr 寄存器保存着 main 函数的返回地址,此跳转将使程序返回到 main 中。
bx lr

带 interwork(交互工作)的分支到链接寄存器(link register)。这个函数非常小且简单,在优化模式下构建时没有序言(prologue)、尾声(epilogue)和栈帧(stack frame)。这对 ARM 代码来说是一个显著的优化,而在 x86_64 架构上则不会有这种效果。由于 lr 寄存器包含了 main 函数的返回地址,因此这会将我们返回到那里。

这就是整个函数!是不是很简单?

结论
以上是对 ARM 汇编的旋风式概览,也是本系列关于汇编语言文章的结尾。在两种架构中,我都还有相当多的内容没有涉及,例如条件指令、循环以及指令的可选标志位更新,但这毕竟只是一个介绍。希望您喜欢这个系列。感谢阅读!


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2011-12-30-disassembling-the-assembly-part-3-arm-edition.html

Gwynne finishes off her series on analyzing assembly code with a look at ARM assembly, for all of your iOS needs. Gwynne will be contributing the occasional article in the future as well as a guest author, without my introductions. Watch the Author field at the top of the post to see who’s writing what. Without further ado, let’s take a look at ARM.

Since I wrote part 1 of this series on reading x86_64 assembly language, I’ve gotten several requests for a version which talks about reading ARM assembly language, as ARM is the architecture used by devices running iOS. Unfortunately, at the time of those requests, I didn’t know much of anything about ARM! So rather than disappoint, I embarked on a crash course in the instruction set.

Fortunately, it turned out that ARM isn’t all that complicated; it’s more like learning a new dialect of a language you already know than learning an entirely new language. For sanity’s sake, in this article I will assume you’ve already read both part 1 and part 2, and explain only the differences that appear in ARM rather than reiterating all the basic concepts.

I have most likely made at least one mistake in my understanding of ARM, and I gladly invite any explanations and corrections that people may find.

CPU and ABIUnfortunately, the ARM specifications are a little harder to get your hands on than x86_64’s; you have to go to their website and register for an account before you can download the PDFs. On the bright side, it’s not a paywall and the registration process is fairly simple. If you go to the trouble, look for the documents “ARMv7-AR Architecture Reference Manual (Issue C)” under “ARM Architecture”, and “Base Platform ABI for the ARM Architecture” and “Procedure Call Standard for the ARM Architecture” under “ARM software development tools”. Apple’s documentation on the ARM ABI used for iOS is public, and gives nearly all information one would typically need regarding the ABI.

I found it a little bemusing that the primary documentation for ARM is split across several dozen documents, while that for x86_64 is in six PDFs total. There are an equal number of documents under the x86_64 umbrella as a whole, but you need only two (maximum three if you do SIMD work) of them to do applications programming, and those are very clearly labeled. At least three documents are needed just to get the same amount of data for ARM, and they’re considerably harder to find unless you already know a number of details about the platform and high-level language you’ll be working with.

The many flavors of ARMBefore I can talk about the particulars of how ARM does what it does, I have to explain something about the architecture: There are at least a dozen flavors of it. Unlike x86, ARM was designed from its very beginning for use in a variety of environments, all with different requirements for speed, power consumption, and efficiency. It has also gone through a number of major revisions, and has a long list of optional features. In this article I will focus on the particular flavor of ARM used by the modern iPhone, iPad, and iPod Touch: The ARMv7 architecture, Application profile, with Thumb 2 and NEON SIMD.

What does this mean? It means I’ll be working from version 7 of the ARM instruction set (there are currently 8 revisions, but ARMv8 is not yet implemented in any shipping processors). I’ll focus on the Application profile, which is intended for typical operating systems, rather than the Real-time profile (intended for small - you guessed it! - realtime systems) or the Mobile profile (intended for embedded systems). ARMv6, used in first- and second-generation iDevices save the iPad, is largely similar, but not identical. It also means I’ll use the Thumb instruction set, as it’s strongly recommended by Apple for use on iOS.

ThumbThe ARM architecture specifies a secondary instruction set called Thumb. The purpose of Thumb is fairly simple: Do all the common tasks with smaller instructions. While ARM instructions have a fixed size of 32 bits each, Thumb instructions (with a very few exceptions) are 16 bits each, making for much smaller code. Apple has recommending building iOS code with Thumb for most of iOS’ lifetime for exactly this reason, though their toolchain is famous for creating crashing code when building ARMv6 Thumb with floating-point support.

In ARMv6, reading Thumb versus ARM assembler was annoying at best, as they didn’t use the same language. With ARMv7, however, ARM developed a “Unified Assembly Language” (UAL) which covers all the operations of both ARM and Thumb instructions in a single (you guessed it!) unified set of mnemonics. This is another reason I’m sticking to ARMv7.

Note: It was also not until ARMv7 that Thumb (more specifically, Thumb 2) got proper support for floating-point extensions, which is part of the reason Apple’s compilers had so much trouble earlier on.

A particular quirk of Thumb is how you tell the CPU which instruction set you’re running at any given moment. This is solved by a set of “interworking” branch instructions; when a branch is taken by one of these instructions to an address stored in a register, it interprets the least significant (last) bit of the address as a flag telling the CPU which mode to switch to. When a branch is taken to a hardcoded address, however, the instruction set flag is simply flipped. Therefore, a bx #12345 instruction will branch to Thumb if the current ISA is ARM, and ARM if the current ISA is Thumb, whereas a bx r5 instruction will set the current ISA based on the low bit of r5 (a 1 bit branches to Thumb, a 0 bit to ARM).

Keep in mind, however, that all this fancy interworking stuff only happens if the program asks it to. The normal b and bl instructions don’t do any of this. Whether or not you change instruction sets during any branch is up to you!

Don’t worry if this doesn’t make too much sense; thanks to UAL, the meaning of the instructions is the same between Thumb and ARM in almsot all cases, so knowing how to read the assembly language is most of what you need. It is important to remember only that most Thumb instructions can only access a limited set of registers, and take much smaller ranges of immediate values.

RegistersARM specifies sixteen general-purpose registers and one status register for application use. NEON further defines sixteen 128-bit vector registers which overlap with the set of thirty-two double precision floating-point registers used by the earlier VFP specification (these in turn overlap with the set of thirty-two single precision registers). Registers with multiple names can be referred to by either name with the same result, though by convention it is always preferred to use a name (such as lr) over a number (such as r14) except in specific situations. Of these fifteen, several are reserved for specific purposes.

  • r0-r3 - The first four GPRs are used for argument passing and return values, but are freely available for use as scratch storage within a function.

  • r4-r6, r8, r10, r11 - The next three GPRs, as well as r8, r10, and r11 are documented as being preserved across function calls, but are also otherwise available for use.

  • r7 - On iOS, r7 is the frame pointer, much as rbp is in x86_64. The ARM architecture in general does not specify a use for r7; this is specific to iOS. To be exact, ARM specifies r11 as fp, but Apple chose not to use that register on iOS, and avoided the fp name so as not to make their assembler incompatible with other ARM implementations.

  • r9 - On iOS 2.x, r9 is a special register used by the OS for unspecified purposes and must not be modified. On iOS 3.0 and above, r9 is free for use and does not need to be preserved across function calls.

  • r12, ip - r12 is the “intra-procedure scratch register”. Between calls, it has the same semantics as r9 and does not have to be preserved. It is also called ip (not to be confused with x86_64’s rip register - they are not similar), and is used as such for computing destination addresses for long branches.

  • r13, sp - r13, also called sp, is the stack pointer. This serves the same purpose as rsp on x86_64 and works in much the same way.

  • r14, lr - The link register is loaded by the bl and blx instructions, which make subroutine calls (much as call does on x86_64). ARM dedicates a register to storing the return address for a call rather than relying on the stack to hold it. This means that a subroutine can, at least in theory, operate without ever touching the stack at all, an important win in real-time and embedded systems.

  • r15, pc - The program counter holds the address of the next instruction to be excuted. It is the counterpart to x86_64’s rip register. Some non-branch instructions can access pc directly, but it is very strongly discouraged except in very specific situations (for example, as we’ll see below, it is a common technique to push lr to the stack in a prolog and pop the value into pc in an epilogue, which results in a return from subroutine).

  • q0-q15 - A set of sixteen 128-bit vector registers which are used for parameter passing, result values, and scratch computation. Of these, q4-q7 are preserved across function calls.

  • d0-d31 - A set of thirty-two 64-bit double-precision floating-point registers. These are mapped onto the corresponding vector registers such that d0 and d1 are the low and high 64 bits of q0, d2 and d3 are the low and high 64 bits of q1, and so on.

  • s0-s31 - A set of thirty-two 32-bit single-precision floating-point registers. These are mapped onto the corresponding double-precision registers such that s0 and s1 are the low and high bits of d0, and so forth. This also implies that s0-s3 are the four 32-bit components of q0.

  • cpsr - The CPU’s status register, partially equivelant to rflags in x86_64. It has the following bits, most of which are not preserved across function calls: N - Negative flag, i.e. the sign bit of the result of a computation. Z - Zero flag, i.e. whether a result is equal to zero. C - Carry flag, i.e. whether an addition carried or a subtraction borrowed. V - Overflow flag, i.e. whether a computation result overflowed its destination. Q - Saturation flag, i.e. whether a computation resulted in saturation; this is used by instructons that saturate on overflow (i.e. set all bits to 1 rather than wrapping around the integer range). GE - The Greater than or Equal flags, used by parallel arithmetic instructions to represent the results from several additions or subtractions at once. E - Endianness. The ARM architecture supports switching the endianness mode of the CPU at runtime, but iOS specifies that this bit must always remain zero for little-endian mode. T - Thumb. This is the status flag which determines whether Thumb or ARM code is being executed. It can not be modified directly and the preservation of its value is context-dependant. (There is also a J flag for Jazelle mode, but iOS does not implement or use it.) All other bits are system-level and can only be accessed by privileged code.

  • N - Negative flag, i.e. the sign bit of the result of a computation.

  • Z - Zero flag, i.e. whether a result is equal to zero.

  • C - Carry flag, i.e. whether an addition carried or a subtraction borrowed.

  • V - Overflow flag, i.e. whether a computation result overflowed its destination.

  • Q - Saturation flag, i.e. whether a computation resulted in saturation; this is used by instructons that saturate on overflow (i.e. set all bits to 1 rather than wrapping around the integer range).

  • GE - The Greater than or Equal flags, used by parallel arithmetic instructions to represent the results from several additions or subtractions at once.

  • E - Endianness. The ARM architecture supports switching the endianness mode of the CPU at runtime, but iOS specifies that this bit must always remain zero for little-endian mode.

  • T - Thumb. This is the status flag which determines whether Thumb or ARM code is being executed. It can not be modified directly and the preservation of its value is context-dependant. (There is also a J flag for Jazelle mode, but iOS does not implement or use it.)

  • All other bits are system-level and can only be accessed by privileged code.

Calling conventionsActually, I don’t have to say much here; it’s all in the registers commentary. r0-r3 are used for integer argument passing and returns, s/d/q0-qN are used for floating-point argument passing and returns, and anything that doesn’t fit in the registers is spilled to the stack. As usual, this is a bit simplified, but it works!

Oh, and by the way, the memory model for ARM is, for all intents and purposes, the same as that for x86_64, save for the smaller memory space.

Disassembling, at lastEnough architecture talk, let’s look at some assembler. I’m using the same code again as in parts 1 and 2, and the following command for compilation:

/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/clang -S test.m -fobjc-arc -arch armv7 -miphoneos-version-min=5.0 -isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.0.sdk -mthumb -Os

That’s use the iOS SDK’s Clang, show assembler output, compile with ARC, compile for ARMv7, require iOS 5 (required to link properly), build against the iOS SDK (required for ARM compilation), use Thumb, and optimize for size. And here’s the result for the main function:

.thumb_func _main
_main:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r8, [sp, #-4]!
blx _objc_autoreleasePoolPush
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
mov r8, r0
movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_0:
add r1, pc
movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_1:
add r0, pc
ldr r1, [r1]
ldr r0, [r0]
blx _objc_msgSend
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
movs r3, #42
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
LPC8_2:
add r1, pc
ldr r1, [r1]
movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4))
movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4))
LPC8_3:
add r2, pc
blx _objc_msgSend
mov r5, r0
movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
LPC8_4:
add r0, pc
ldr r1, [r0]
mov r0, r5
blx _objc_msgSend
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r4, r0
mov r0, r4
bl _MyFunction
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r6, r0
mov r0, r4
blx _objc_release
movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4))
mov r1, r6
movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4))
LPC8_5:
add r0, pc
blx _NSLog
mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0
movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
vcvt.f64.f32 d16, s0
LPC8_6:
add r0, pc
vmov r1, r2, d16
blx _NSLog
mov r0, r6
blx _objc_release
mov r0, r5
blx _objc_release
mov r0, r8
blx _objc_autoreleasePoolPop
movs r0, #0
ldr r8, [sp], #4
pop {r4, r5, r6, r7, pc}

Whew! For a function built with an instruction set specifically designed to be small, this is pretty long. Seventy-two lines. In truth, the compiled machine code is quite small; only the assembler that produces that code is verbose, which is typical of RISC architectures.

Some quick notes about ARM assembler syntax:

  • The order of operands in an ARM instruction is reversed from the “GAS” (GNU ASsembler) syntax used by the x86_64 assembler. A typical instruction is “mnemonic destination, operand1, operand2”. This is ironically closer to the original Intel assembler syntax.

  • Immediate operands are delimited with # rather than $.

  • Register names are not prefixed by %.

Let’s see what we can make of main:

  • .thumb_func _main _main: push {r4, r5, r6, r7, lr} add r7, sp, #12 str r8, [sp, #-4]! .thumb_func is an assembler directive that tells clang to use the Thumb ISA for the function named. Then we have the symbol for the function. ARM’s push instruction can take an entire list of registers at once to save to the stack, so main is saving r4 through r6 so it can use them as scratch, r7 (the old frame pointer), and lr (the return address of the calling function). lr must be saved because any subroutine calls main makes will overwrite the value - it is not preserved across function calls. Not only that, but iOS (and any other implementation with a frame pointer) requires the return address to be on the stack as part of the stack frame for the function. The iOS ABI says that failure to set up a stack frame “can prevent debugging and performance tools from generating valid backtraces.” Next, we add 12 to the stack pointer and store the result in r7 to create the frame pointer. Notice that this does not modify sp itself! The last instruction is rather tricky: It subtracts 4 from sp, saves the new value of sp, and writes the value in r8 to memory at sp. The net result of this is to push r8 to the stack. The compiler doesn’t use another push instruction because the encoding of str for r8 is smaller in Thumb (16-bit versus needing a 32-bit Thumb instruction to push it).
.thumb_func _main
_main:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r8, [sp, #-4]!

.thumb_func is an assembler directive that tells clang to use the Thumb ISA for the function named. Then we have the symbol for the function. ARM’s push instruction can take an entire list of registers at once to save to the stack, so main is saving r4 through r6 so it can use them as scratch, r7 (the old frame pointer), and lr (the return address of the calling function). lr must be saved because any subroutine calls main makes will overwrite the value - it is not preserved across function calls. Not only that, but iOS (and any other implementation with a frame pointer) requires the return address to be on the stack as part of the stack frame for the function. The iOS ABI says that failure to set up a stack frame “can prevent debugging and performance tools from generating valid backtraces.” Next, we add 12 to the stack pointer and store the result in r7 to create the frame pointer. Notice that this does not modify sp itself! The last instruction is rather tricky: It subtracts 4 from sp, saves the new value of sp, and writes the value in r8 to memory at sp. The net result of this is to push r8 to the stack. The compiler doesn’t use another push instruction because the encoding of str for r8 is smaller in Thumb (16-bit versus needing a 32-bit Thumb instruction to push it).

  • blx _objc_autoreleasePoolPush mov r8, r0 Branch and link with interworking to objc_autoreleasePoolPush. This is just a subroutine call, with the option to switch to the ARM ISA. The function takes no parameters, and will return its result in r0. The result is saved in r8. Yes, I know I’ve read the instructions out of order here, but the compiler separated them for some reason and it makes no difference to the code flow to show them in conceptual order rather than assembler order.
blx _objc_autoreleasePoolPush
mov r8, r0

Branch and link with interworking to objc_autoreleasePoolPush. This is just a subroutine call, with the option to switch to the ARM ISA. The function takes no parameters, and will return its result in r0. The result is saved in r8. Yes, I know I’ve read the instructions out of order here, but the compiler separated them for some reason and it makes no difference to the code flow to show them in conceptual order rather than assembler order.

  • movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4)) movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES__-(LPC8_1+4)) LPC8_0: add r1, pc movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_-(LPC8_1+4)) LPC8_1: add r0, pc ldr r1, [r1] ldr r0, [r0] blx objc_msgSend Welcome to calling Objective-C selectors on ARM! This reads: Take the address of label L_OBJC_SELECTOR_REFERENCES_25, subtract the address of label LPC8_0+4, and select the low sixteen bits of the result, writing them into the bottom 16 bits of r1. Then take upper 16 bits of the same address and write them into the upper 16 bits of r1. This is done in two steps because there is no Thumb encoding for a 32-bit immediate register load. pc is then added to r1, forming a pc-relative address. This idiom should look familiar from x86_64’s rip-relative addressing. The same is done again for L_OBJC_CLASSLIST_REFERENCES$, loading it into r0. The two ldr instructions read values from the addresses in the two registers into the registers themselves, equivelant to the C code a = *a;. Finaly, the code does a branch and link with interworking instruction to objc_msgSend, completing the method call. Congratulations, you’ve just learned how ARM calls [MyClass alloc]! The vtable tricks done in x86_64 don’t exist for ARM, as ARM’s addressing and branching modes don’t make direct dispatch more efficient than a subroutine call.
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_0:
add r1, pc
movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_1:
add r0, pc
ldr r1, [r1]
ldr r0, [r0]
blx _objc_msgSend

Welcome to calling Objective-C selectors on ARM! This reads: Take the address of label L_OBJC_SELECTOR_REFERENCES_25, subtract the address of label LPC8_0+4, and select the low sixteen bits of the result, writing them into the bottom 16 bits of r1. Then take upper 16 bits of the same address and write them into the upper 16 bits of r1. This is done in two steps because there is no Thumb encoding for a 32-bit immediate register load. pc is then added to r1, forming a pc-relative address. This idiom should look familiar from x86_64’s rip-relative addressing. The same is done again for L_OBJC_CLASSLIST_REFERENCES_$_, loading it into r0. The two ldr instructions read values from the addresses in the two registers into the registers themselves, equivelant to the C code a = *a;. Finaly, the code does a branch and link with interworking instruction to objc_msgSend, completing the method call. Congratulations, you’ve just learned how ARM calls [MyClass alloc]! The vtable tricks done in x86_64 don’t exist for ARM, as ARM’s addressing and branching modes don’t make direct dispatch more efficient than a subroutine call.

  • movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) LPC8_2: add r1, pc ldr r1, [r1] movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4)) movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4)) LPC8_3: add r2, pc movs r3, #42 blx _objc_msgSend mov r5, r0 Once again I’ve slightly rearranged the instruction stream to make more sense. This call is really no different from the first. The result from alloc is already in r0. The -initWithName:number: selector is loaded into r1, the string @“name” into r2, and the immediate value 42 into r3. The result of the method call is saved in r5.
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
LPC8_2:
add r1, pc
ldr r1, [r1]
movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4))
movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4))
LPC8_3:
add r2, pc
movs r3, #42
blx _objc_msgSend
mov r5, r0

Once again I’ve slightly rearranged the instruction stream to make more sense. This call is really no different from the first. The result from alloc is already in r0. The -initWithName:number: selector is loaded into r1, the string @“name” into r2, and the immediate value 42 into r3. The result of the method call is saved in r5.

  • movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4)) movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4)) LPC8_4: add r0, pc ldr r1, [r0] mov r0, r5 blx _objc_msgSend This is the [obj name] method call. Nothing special here.
movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
LPC8_4:
add r0, pc
ldr r1, [r0]
mov r0, r5
blx _objc_msgSend

This is the [obj name] method call. Nothing special here.

  • mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue blx _objc_retainAutoreleasedReturnValue mov r4, r0 The ARC-aware compiler inserts an effective nop (move a register to itself) into the instruction stream so certain code in the Objective-C runtime can detect the call to objc_retainAutoreleasedReturnValue, then actulaly makes the call. No registers need to be set up, as the return value in r0 matches the intended first parameter also in r0. The result of the function, i.e. the retained [obj name], is saved in r4.
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r4, r0

The ARC-aware compiler inserts an effective nop (move a register to itself) into the instruction stream so certain code in the Objective-C runtime can detect the call to objc_retainAutoreleasedReturnValue, then actulaly makes the call. No registers need to be set up, as the return value in r0 matches the intended first parameter also in r0. The result of the function, i.e. the retained [obj name], is saved in r4.

  • mov r0, r4 bl _MyFunction mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue blx _objc_retainAutoreleasedReturnValue mov r6, r0 The value from r4 is reloaded into r0, unnecessarily! The compiler seems to have stumbled in optimizing here, and I’m not sure why. In any case, it then calls MyFunction, and another objc_retainAutoreleasedReturnValue stanza, saving the final result (i.e. string) in r6.
mov r0, r4
bl _MyFunction
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r6, r0

The value from r4 is reloaded into r0, unnecessarily! The compiler seems to have stumbled in optimizing here, and I’m not sure why. In any case, it then calls MyFunction, and another objc_retainAutoreleasedReturnValue stanza, saving the final result (i.e. string) in r6.

  • mov r0, r4 blx _objc_release [obj name] is no longer needed, so release it.
mov r0, r4
blx _objc_release

[obj name] is no longer needed, so release it.

  • movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4)) movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4)) LPC8_5: add r0, pc mov r1, r6 blx _NSLog Again having rearranged the instructions slightly, this is a load of @”%@” into r0 and string into r1, and a call to NSLog. Variadic functions appear to have no special semantics on ARM.
movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4))
movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4))
LPC8_5:
add r0, pc
mov r1, r6
blx _NSLog

Again having rearranged the instructions slightly, this is a load of @”%@” into r0 and string into r1, and a call to NSLog. Variadic functions appear to have no special semantics on ARM.

  • mov.w r0, #1065353216 bl _MyFPFunction vmov s0, r0 The 32-bit decimal representation of 1.0 is loaded into r0, an integer register. Why? Because it fits and integer registers are more efficient than vector or floating-point registers when possible. MyFPFunction is called, and the result (again in r0!) loaded into the s0 floating-point register. Note: The .w suffix on the mov instruction means to use the 32-bit Thumb encoding for the instruction even if the 16-bit encoding would otherwise be selected. For this particular instruction, it is probably unnecessary, but it’s considered good form to use the annotation even if the 32-bit encoding would chosen automatically.
mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0

The 32-bit decimal representation of 1.0 is loaded into r0, an integer register. Why? Because it fits and integer registers are more efficient than vector or floating-point registers when possible. MyFPFunction is called, and the result (again in r0!) loaded into the s0 floating-point register. Note: The .w suffix on the mov instruction means to use the 32-bit Thumb encoding for the instruction even if the 16-bit encoding would otherwise be selected. For this particular instruction, it is probably unnecessary, but it’s considered good form to use the annotation even if the 32-bit encoding would chosen automatically.

  • movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4)) movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4)) LPC8_6: add r0, pc vcvt.f64.f32 d16, s0 vmov r1, r2, d16 blx _NSLog Load @“%f” into r0. Convert the single-precision floating-point value in s0 to a double-precision floating-point value in d16, per the C type promotion rules for variadic function parameter lists (float is converted to double). Load the double-precision value in d16 into the two integer registers r1 and r2, and call NSLog. It is always preferred to pass floating-point arguments in integer registers if they are available.
movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
LPC8_6:
add r0, pc
vcvt.f64.f32 d16, s0
vmov r1, r2, d16
blx _NSLog

Load @“%f” into r0. Convert the single-precision floating-point value in s0 to a double-precision floating-point value in d16, per the C type promotion rules for variadic function parameter lists (float is converted to double). Load the double-precision value in d16 into the two integer registers r1 and r2, and call NSLog. It is always preferred to pass floating-point arguments in integer registers if they are available.

  • mov r0, r6 blx _objc_release mov r0, r5 blx _objc_release mov r0, r8 blx _objc_autoreleasePoolPop Release string, release objc, pop the autorelease pool.
mov r0, r6
blx _objc_release
mov r0, r5
blx _objc_release
mov r0, r8
blx _objc_autoreleasePoolPop

Release string, release objc, pop the autorelease pool.

  • movs r0, #0 ldr r8, [sp], #4 pop {r4, r5, r6, r7, pc} Load 0 as the return value of main, pop r8 off the stack (remember, r8 can not be efficiently addressed by a pop instruction in Thumb), and restore all the saved registers. Notice that while we saved lr at the beginning of main, we now pop that value into pc. This effectively executes an interworking return from subroutine. (pop into pc is explicitly documented as an interworking branch).
movs r0, #0
ldr r8, [sp], #4
pop {r4, r5, r6, r7, pc}

Load 0 as the return value of main, pop r8 off the stack (remember, r8 can not be efficiently addressed by a pop instruction in Thumb), and restore all the saved registers. Notice that while we saved lr at the beginning of main, we now pop that value into pc. This effectively executes an interworking return from subroutine. (pop into pc is explicitly documented as an interworking branch).

And that’s main.

MyFPFunctionLet’s take a quick look at the floating-point operations:

.thumb_func _MyFPFunction
_MyFPFunction:
vmov.f32 s2, #5.000000e-01
vldr.32 s0, LCPI7_0
vmov s4, r0
vadd.f32 d16, d2, d1
vadd.f32 d0, d16, d0
vmov r0, s0
bx lr
LCPI7_0:
.long 3197737370
  • Declare a Thumb function MyFPFunction.

  • vmov.f32 s2, #5.000000e-01 vldr.32 s0, LCPI7_0 vmov s4, r0 Load the immediate single-precision floating-point value 5.0 into s2. Load the single-precision floating-point value at LCPI7_0 (-0.3) into s0. Load the single-precision floating-point value stored in r0 into s4.

vmov.f32 s2, #5.000000e-01
vldr.32 s0, LCPI7_0
vmov s4, r0

Load the immediate single-precision floating-point value 5.0 into s2. Load the single-precision floating-point value at LCPI7_0 (-0.3) into s0. Load the single-precision floating-point value stored in r0 into s4.

  • vadd.f32 d16, d2, d1 vadd.f32 d0, d16, d0 vmov r0, s0 Using single-precision math, first add d1 and d2 into d16 (remember that s0 corresponds to d0, s2 corresponds to d1, and s4 corresponds to d2). Then add d0 and d16 into d0. Finally, store the single-precision result back into r0.
vadd.f32 d16, d2, d1
vadd.f32 d0, d16, d0
vmov r0, s0

Using single-precision math, first add d1 and d2 into d16 (remember that s0 corresponds to d0, s2 corresponds to d1, and s4 corresponds to d2). Then add d0 and d16 into d0. Finally, store the single-precision result back into r0.

  • bx lr Branch, with interworking, to the link register. This function is so small and simple that it has no prologue, no epilogue, and no stack frame, when built in optimizing mode. This is a noticable win for ARM code, where it wouldn’t have been for x86_64. Since lr contains the return address for main, this returns us there.
bx lr

Branch, with interworking, to the link register. This function is so small and simple that it has no prologue, no epilogue, and no stack frame, when built in optimizing mode. This is a noticable win for ARM code, where it wouldn’t have been for x86_64. Since lr contains the return address for main, this returns us there.

That’s the whole function! Wasn’t that simple?

ConclusionThis concludes the whirlwind tour of ARM assembly, and this series of articles on assembly in general. There is quite a bit I haven’t covered in both architectures, such as conditional instructions, looping, and optional flag updates on instructions, but this was meant only as an introduction after all. I hope you’ve enjoyed it. Thanks for reading!