dyld:OS X 上的动态链接

Mike Ash Friday Q&A 中文译文:dyld:OS X 上的动态链接

作者 TommyWu
封面圖片: dyld:OS X 上的动态链接

译文 · 原文: Friday Q&A 2012-11-09: dyld: Dynamic Linking On OS X · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2012-11-09-dyld-dynamic-linking-on-os-x.html 发布:2012-11-09 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


在最近一次求职面试中,我有机会研究了 OS X 动态链接器 dyld 的一些内部机制。我发现系统的这个特定领域相当有趣,并且看到很多人在链接问题上遇到困难,因此决定撰写一篇关于动态链接基础知识的文章。其中一些深层逻辑对我而言也是新知识,若有不准确之处还请见谅。

警告
由于 dyld 工作方式的精确细节非常复杂且频繁变化,加之我本人尚未完全掌握所有细节,本文中对 dyld 的多数分析都是简化的,部分内容纯属概念性阐述。若您对具体细节感兴趣,强烈建议查阅 dyld 的源代码 —— 其公开地址为http://opensource.apple.com。

静态链接
那么,让我们先从静态链接(通常简称为’链接’)说起。这是编译之后通常发生的步骤,将编译器从源代码生成的机器语言(即目标文件)‘链接’成一个单独的二进制文件。

静态链接为何与动态链接相关?因为静态链接器 ld(和 ld64)负责将源代码中的符号引用(symbol references)转换为间接符号查找(indirect symbol lookups),供 dyld 后续使用。这里有一个非常简单的例子:

// This is the actual full declaration of main() on OS X. The "apple"
// parameter is the path to the executable, i.e. _NSGetProgname().
int main(int argc, char **argv, char **envp, char **apple)
{
puts("Hello, world!\n");
return 0;
}

此代码经 clang -S test.c -o test.s -Os 编译后生成的(优化)汇编指令如下(已去除了一些调试信息):

.section __TEXT,__text,regular,pure_instructions
.globl _main
_main: ## @main
pushq %rbp
movq %rsp, %rbp
leaq L_str(%rip), %rdi
callq _puts
xorl %eax, %eax
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_str: ## @str
.asciz "Hello, world!"

看起来足够直接了。让我们将它编译成目标文件并反汇编完全编译后的版本(clang -c test.c -o test.o -Os, otool -tv test.o):

_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 leaq 0x00000000(%rip),%rdi
000000000000000b callq 0x00000010
0000000000000010 xorl %eax,%eax
0000000000000012 popq %rbp
0000000000000013 ret

哎呀,我们的符号名称消失了!编译器已经将它们替换成了零字节集合。对于 leaq 指令,结果是从当前 rip 值进行加载。callq 指令是” 有符号偏移量” 跳转,这意味着零偏移量会调用代码中的下一条指令(此处为地址 0x10)。别担心,编译器已经生成了重定位条目(relocation entries),这些条目会告诉链接器在哪里更新所有这些零值(otool -r test.o):

Relocation information (__TEXT,__text) 2 entries
address pcrel length extern type scattered symbolnum/value
0000000c 1 2 1 2 0 4
00000007 1 2 1 1 0 0

第一条记录显示:“在 __TEXT,__text 节的偏移量 0xc 处,存在一个长度为 ‘long word’(长字)的、非分散的、外部的、PC 相对(PC-relative)X86_64_RELOC_BRANCH 重定位项,指向符号表中索引为 4 的符号。“通过查看符号表(使用命令 nm -ap),我们得到:

0000000000000014 s L_str
0000000000000048 s EH_frame0
0000000000000000 T _main
0000000000000060 S _main.eh
U _puts

索引为 4 的符号(第五个条目)是 _puts。类似地,索引为 0 的符号是 L_str,它将在目标文件偏移量 0x7 处被重定位(译注:即 leaq 指令中的第 3 字节处)。最后,让我们看看将这个目标文件链接成可执行文件的结果(命令:clang test.c -o test -Os,使用otool -tv test查看):

_main:
0000000100000f36 pushq %rbp
0000000100000f37 movq %rsp,%rbp
0000000100000f3a leaq 0x00000029(%rip),%rdi
0000000100000f41 callq 0x100000f4a
0000000100000f46 xorl %eax,%eax
0000000100000f48 popq %rbp
0000000100000f49 ret

链接器 ld 完成了以下操作:

  • __TEXT 段定位到 x86_64 标准的可执行文件加载地址 0x0000000100000000,并将 __TEXT,__text 节放置在该地址偏移 0xf36 处。__TEXT 段的前 0xf35 字节(实际上,由于更大的偏移量未考虑文件的 Mach-O 头,实际为 0xa0f 字节)被清零。这使得 __TEXT 段与 __DATA 段紧密对齐。我不完全确定为何这样做,但我推测这与缓存效率(cache efficiency)有关。

  • 0 替换为 leaq 指令到 L_str 符号的实际偏移量,在本例中为 0x29。计算结果地址是 0x100000f61,通过查看加载命令(使用 otool -l test)可知,这正是 __TEXT,__cstring 节的起始位置。

  • 0 替换为 puts() 函数的符号存根(symbol stub)地址,该地址紧跟在 main 函数之后。再次查看加载命令,可知它位于 __TEXT,__stubs 节,我们稍后将详细研究该节。

静态链接(static linking)会合并目标文件(object files),解析指向外部库的符号引用(symbol references),为这些符号应用重定位(relocations),并构建完整的可执行文件。显然,这里做了极大简化,并且仅适用于可执行文件。动态库(dynamic libraries)的链接过程与此相似但并非完全相同,为简洁起见,此处不展开讨论。

那么 dyld 究竟负责什么工作呢?总体而言,dyld 承担了相当多的工作。它大致按以下顺序执行:

  • 基于内核为进程设置的极简原始栈(raw stack)进行自举(bootstraps itself)。

  • 递归且带缓存地将可执行文件所链接的全部依赖动态库加载至进程内存空间,包括根据环境变量和可执行文件的” 运行路径(runpaths)“遍历必要的搜索路径。

  • 通过立即绑定非惰性符号(non-lazy symbols)并为惰性绑定(lazy binding)设置必要表格,将这些库链接到可执行文件。

  • 运行可执行文件的静态初始化器(static initializers)。

  • 设置可执行文件 main 函数的参数并调用该函数。

  • 在进程执行过程中,通过绑定符号来处理惰性绑定符号桩(lazily-bound symbol stubs)的调用,提供运行时动态加载服务(通过 dl*() API),并为 gdb 等调试器提供获取关键信息的钩子(hooks)。

  • main 函数返回后,运行静态终止例程(static terminator routines)。

  • 在某些场景下,当 main 函数返回后,会调用 libSystem_exit 例程。

我将大致按顺序审视每个步骤。

引导(Bootstrap)

dyld 是新进程中运行的第一段代码。具体来说,一个描述性非常强的符号 __dyld_start 会被调用。这得益于内核中的一点 “魔法”:它会注意到主可执行文件中的 LC_LOAD_DYLINKER 加载命令,并使用给定的动态链接器的入口符号(entry symbol)作为进程的初始指令指针(instruction pointer)。__dyld_start 执行以下伪代码(实际实现是一段紧凑的汇编代码):

noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
{
stack push 0 // debugger end of frames marker
stack align 16 // SSE align stack
uint64_t slide = __dyld_start - __dyld_start_static;
void *glue = NULL;
void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
if (glue)
push glue // pretend the return address is a glue routine in dyld
else
stack restore // undo stack stuff we did before
goto *entry(argc, argv, envp, apple); // never returns
}

回过头来看,我不确定伪代码是否比直接用汇编更清晰,不过让我们快速过一遍:

  • 将一个 0 压入栈中,并按照 SSE 要求进行栈对齐。

  • 通过从当前 __dyld_start 地址减去一个地址始终固定的符号的地址,计算 dyld 自身的 slide(地址偏移)。

  • 运行 dyld 的实际引导程序。该程序为 dyld 自身设置一些最小状态(例如从 libSystem 中提取某些函数而无需实际链接到它,并设置 Mach 消息传递),然后运行 dyld 的真正主程序,后者执行加载、链接和初始化。

  • 如果 dyld 检测到主可执行文件使用 LC_MAIN load command(加载命令)来设置其入口点,它将返回一个胶水程序的地址,该程序负责在进程结束时调用 _exit。该地址被压入栈中,欺骗入口点使其认为这是该函数的返回地址;该函数末尾的 ret 指令将跳转到该胶水代码。

  • 反之,如果 dyld 检测到可执行文件使用的是旧式的 LC_UNIXTHREAD 加载命令,它会简单地将栈恢复到原始状态,并跳转到该入口点。该入口点将是来自 crt1.o(即 C 运行时)的起始例程。C 运行时基本上会重新执行 __dyld_start 刚才所做的全部工作,但不包括实际的 dyld 启动 —— 这也是它被 LC_MAIN 命令取代的原因之一。

  • 跳转至入口点

每次 dyld 必须加载一个动态库时 —— 无论是在应用程序启动期间还是由于运行时的请求 —— 它都必须:在磁盘上定位正确的二进制文件;将该文件映射到内存;解析其 Mach-O 头部;并记录下刚刚生成的所有数据以供链接(在上下文中指的是符号绑定)使用。(天哪,“链接” 这个词的用法可真多,不是吗?)

在磁盘上定位正确的二进制文件通常相当简单。LC_LOAD_DYLIB 命令会给出一个绝对路径,二进制文件就从该路径加载。当然,有时该路径会包含一个特殊标记,指示 dyld 去其他位置搜索:

  • @executable_path — 在 OS X 10.3 之前,这是 dyld(动态链接编辑器)唯一支持的标记,其用途相当有限。dyld 会将此标记替换为主可执行文件的完整路径。

  • @loader_path — 此标记在 10.4 中新增,它会被替换为加载当前正在加载的二进制文件的那个二进制文件的完整路径。这个加载者不一定总是主可执行文件,它主要使得框架能够自行嵌套框架,而无需依赖 “伞框架”(umbrella framework)机制(苹果从未完全公开该机制,并且积极劝阻使用)。

  • @rpath — 当此标记在 10.5 中被引入时,开发者们非常欣喜。此标记会按顺序替换为当前二进制文件(递归地)所有加载者中嵌入的每个 “运行路径”(run path)。这最终使得框架和动态库只需构建一次,即可在不修改其安装名(install name)的情况下,既用于系统范围的安装,也用于嵌入式使用。它还允许应用程序为某个库指定备用位置,甚至能覆盖一个深度嵌入库的指定位置。

链接
一旦动态库(dynamic library)被加载进进程(暂不考虑与地址空间布局随机化相关的操作,且暂不讨论代码签名问题),其非懒绑定(non-lazy binding)符号必须立即完成绑定。

此处我需要简要说明懒绑定(lazy binding)符号与非懒绑定符号的区别。这并不复杂:懒绑定符号的绑定过程会延迟到可执行文件首次调用该符号时才执行;而非懒绑定符号则在包含它的库被加载时就立即完成绑定。两者的实际绑定过程完全相同,唯一的区别在于触发绑定的时机。

从概念上讲,绑定一个符号很简单。但在实践中,这个过程颇有意思:

  • 在可执行文件的 __LINKEDIT 段(段)的绑定信息中查找该符号对应的符号桩(symbol stub)地址。以上文的例子为例,_puts 的符号桩位于 0xf4a(为简化起见有所缩短!)。如果我们反汇编该地址处的机器代码,将会得到:

    (__TEXT,__stubs) 段内容 0000000100000f4a jmp *0x000000c0 (% rip)

    (__TEXT,__stub_helper) 段内容 0000000100000f50 leaq 0x000000b1 (% rip),% r11 0000000100000f57 pushq % r11 0000000100000f59 jmp *0x000000a1 (% rip) 0000000100000f5f nop 0000000100000f60 pushq $0x00000000 0000000100000f65 jmp 0x100000f50

    哇,一个漂亮简单的跳转指令!不幸的是,事情并非简单地将跳转目标替换为符号地址那么轻松,因为该跳转只能使用有符号的 32 位偏移量,而符号地址可能(也理应!)位于 64 位地址空间的任意位置。因此,下一步就是……

在可执行文件的 __LINKEDIT 段绑定信息中,查找该符号对应的符号存根地址。以上文的例子为例,_puts 的存根位于 0xf4a(为简化说明我做了省略)。如果我们要反汇编该地址处的机器码,将会得到:

Contents of (__TEXT,__stubs) section
0000000100000f4a jmp *0x000000c0(%rip)
Contents of (__TEXT,__stub_helper) section
0000000100000f50 leaq 0x000000b1(%rip),%r11
0000000100000f57 pushq %r11
0000000100000f59 jmp *0x000000a1(%rip)
0000000100000f5f nop
0000000100000f60 pushq $0x00000000
0000000100000f65 jmp 0x100000f50

哇,一个简单的跳转指令!然而,将跳转目标替换为符号地址并不是那么简单,因为跳转指令只能是一个有符号的 32 位偏移量,而符号可能(也确实应该!)位于 64 位地址空间的任何地方。因此,下一步是……

  • 同样在绑定信息中,查找 puts 符号指针在 __DATA,__nl_symbol_ptr 节区中的地址。如果这是一个惰性符号(lazy symbol),则改为在 __DATA,__la_symbol_ptr 节区中查找。在我们的示例可执行文件中,这些节区看起来很简单(综合了 otool 的输出):
Contents of (__DATA,__nl_symbol_ptr) section
0000000100001000 dq 0x0000000000000000
0000000100001008 dq 0x0000000000000000
Contents of (__DATA,__la_symbol_ptr) section
0000000100001010 dq 0x0000000100000f60

简而言之,非惰性符号指针(non-lazy symbol pointers)仅仅是零字节,而惰性符号指针(lazy symbol pointer)直接指回了存根辅助节区(stub helper section)!

同样在绑定信息(binding information)中,查找 puts 的符号指针(symbol pointer)在 __DATA,__nl_symbol_ptr 节中的地址。如果这是一个惰性符号(lazy symbol),则改为在 __DATA,__la_symbol_ptr 节中查找它。在我们的示例可执行文件中,这些节看起来像这样(使用 otool 输出的混合形式):

Contents of (__DATA,__nl_symbol_ptr) section
0000000100001000 dq 0x0000000000000000
0000000100001008 dq 0x0000000000000000
Contents of (__DATA,__la_symbol_ptr) section
0000000100001010 dq 0x0000000100000f60

简而言之,非惰性符号指针(non-lazy symbol pointers)全都是零字节,而惰性符号指针(lazy symbol pointers)则直接指回桩辅助段(stub helper section)!

  • 将相应 __DATA 段中符号指针的地址更新为已加载库中该符号的真实地址。完成!

你可能会问,所有这些复杂的间接寻址以及这些额外的段究竟是为了什么?

对于非惰性符号而言,这种间接性存在有两个原因。首先,你不能在 __TEXT 段中放置可写数据,因为该段存放的是可执行代码。这意味着你无法在运行时直接更新跳转指令,即便该跳转指令支持绝对 64 位地址。其次,你也不能在 __DATA 段中放置可执行代码,因为该段存放的是可写数据!因此你也不能在那里直接放置 64 位跳转指令。结果就是,跳转指令被编码为需要额外一层间接寻址,就像在 C 语言中对指针进行解引用一样。(译注:现代 macOS 的 ASLR 及 Mach-O 加载机制可能已优化此过程)

这一切对延迟绑定的符号(lazily-bound symbols)同样成立,但需注意若干细节。动态链接器 dyld 并不立即绑定此类符号,而是任其保持原状。静态链接器在延迟符号指针(lazy symbol pointer)中保存的地址并非简单的 0,而是指向” 桩辅助代码”(stub helper)。桩辅助代码是嵌入在 __TEXT,__stub_helper 节(译注:此节名直译为” 桩辅助代码节”,现代系统可能已更新此节名)中的一段代码,它将需要更新的延迟符号指针表内的偏移量压入栈中,然后跳转至 dyld 内部符号绑定器所对应的(非延迟绑定的!)符号。在本例这样简单的演示中无法体现,但实际上每新增一个延迟符号,桩辅助代码就会增加两条指令,以确保将正确的偏移量传递给 dyld。当延迟绑定完成后,符号指针会照常更新,且该符号的桩辅助代码此后将不再被调用。

静态初始化器、静态终止器与运行时服务

到此阶段,大部分关键操作已经完成。dyld 将执行可执行文件中的所有静态初始化器(最常见的包括全局 C++ 对象的构造函数和 Objective-C 类的 +load 方法,此外还有用于纯 C 语言的 __attribute__((constructor)) 函数)。初始化器列表存储在二进制文件的独立 __DATA,__mod_init_func 段中,本质上是一组指向 __TEXT,__text 段的地址,dyld 会按顺序调用它们。初始化器函数接收的参数与 main 函数相同。

当进程退出时,dyld 同样会运行静态终止器,这主要指 C++ 对象的静态析构函数和 __attribute__((destructor)) 函数。其处理方式与静态初始化器类似,区别在于它们存储在 __DATA,__mod_term_func 段中且不接收任何参数。静态终止器的运行环境与 atexit() 函数相同。

最后,dyld 会为已加载的二进制文件提供运行时服务。dl*() 系列 API 是访问 dyld 服务的首选接口(从 Mac OS X 10.5 起,这是唯一受支持的接口;旧版函数已被弃用):

  • dlopen - 执行动态库的加载阶段,可选择性地部分或完全执行绑定阶段。

  • dlsym - 在动态库(或整个进程)中查找符号。最简单的形式不过是一个 “名称到地址” 的查找。

  • dladdr - 与 dlsym 相反,将地址转换为一组符号信息。

  • dlclose - 从进程中卸载一个动态库,前提是没有其他正在使用的句柄指向它。卸载操作会使该动态库提供的所有符号失效,这在 Objective-C 环境中可能是一个相当敏感的操作。

遗漏内容 尽管本文已讨论了不少内容,但也省略了大量信息:

  • 两级命名空间(two-level namespaces),用于防止动态库中出现简单的符号冲突

  • dyld 共享缓存(dyld shared cache),它维护一个系统范围的已加载动态库映射,以加速绑定

  • 重定基地址(Rebasing)

  • 代码签名(Code signing)

  • 动态库链接(Dynamic library linking)

  • dyld 丰富的环境变量集

  • “受限” 二进制文件(特别是 setuid 二进制文件)

  • 内核与 dyld 交互的大部分内容

  • Mach-O 二进制文件中的压缩与加密

  • dyld 自身的构建方式

  • 符号插入(Symbol interposing)

  • dyld 在 i386 和 ARM 架构上的运作方式,概念上是相同的,但两种架构在细节上存在显著差异

  • Mach-O 二进制格式的细节

  • “胖” 二进制(fat binaries)的处理方式

我省略这些内容有两个原因:其一,撰写本文时我进度有些落后,实在没有时间将所有内容纳入;其二,一篇文章的篇幅确实无法涵盖所有内容。不过,这些概念至少在苹果的文档中有所记载,并且内核与 dyld 均已开源。以下是我希望提供的一些有用链接(警告:部分链接已相当过时,因为苹果似乎不太热衷于更新文档):

苹果的 Mach-O 文档
苹果的 Mach-O 参考
Mach-O “loader” 头文件,一个非常好的参考资料(也可查看 mach-o/ 目录下的其他文件)
苹果的 dyld 参考文档
dlopen (3) 手册页
dyld 的发布说明
10.8.2 版本的 dyld 源代码
10.8.2 版本的内核源代码(特别留意 bsd/kern/kern_exec.c 和 bsd/kern/mach_loader.c)

总结
dyld 是 Mac OS X 最核心的组成部分之一;没有它,除了内核外没有任何程序能够运行。这样的职责不可避免地带来了巨大的复杂性,而 dyld 的复杂度可谓登峰造极。这种复杂性部分源于 dyld 庞大的向后兼容(backwards-compatibility)需求,部分则源于它必须处理的任务范围本身极其广泛。大多数开发者无需如此深入地理解链接(linking)机制,但或许下次在 Xcode 中遇到链接器(linker)报出的奇怪错误信息时,你能更清楚该从何处寻找问题所在。当然,也可能依然毫无头绪 ——ld 这个程序有时确实相当难缠。

这就是本周的全部内容。下周请回来收看 Mike 准备的特别惊喜;他的下一篇文章尤其精彩!


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2012-11-09-dyld-dynamic-linking-on-os-x.html

In the course of a recent job interview, I had an opportunity to study some of the internals of dyld, the OS X dynamic linker. I found this particular corner of the system interesting, and I see a lot of people having trouble with linking issues, so I decided to do an article about the basics of dynamic linking. Some of the deeper logic is new to me, so sorry in advance for any inaccuracies.

WARNINGBecause the precise details of how dyld works are quite complicated and change frequently, and because I don’t yet know all of those details myself, most of my examination of it in this article is simplified, and in some places purely conceptual. If you’re curious about the particulars, I strongly recommend dyld’s source code, which is publicly available at http://opensource.apple.com.

Static linkingSo, let’s start by talking about static linking, generally referred to simply as ‘linking’. This is the step that typically happens after compiling, where the machine language the compiler churned out from your source code, the object files, are ‘linked’ together into a single binary file.

Why does static linking matter to dynamic linking? Because the static linker, ld (and ld64) is responsible for transforming symbol references in your source code into indirect symbol lookups for dyld to use later. Here’s a very simple example:

// This is the actual full declaration of main() on OS X. The "apple"
// parameter is the path to the executable, i.e. _NSGetProgname().
int main(int argc, char **argv, char **envp, char **apple)
{
puts("Hello, world!\n");
return 0;
}

The (optimized) assembly for this, as generated by clang -S test.c -o test.s -Os and stripped of a bit of debug info, is:

.section __TEXT,__text,regular,pure_instructions
.globl _main
_main: ## @main
pushq %rbp
movq %rsp, %rbp
leaq L_str(%rip), %rdi
callq _puts
xorl %eax, %eax
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_str: ## @str
.asciz "Hello, world!"

Seems straightforward enough. Let’s compile it into an object file and dump the fully compiled version (clang -c test.c -o test.o -Os, otool -tv test.o):

_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 leaq 0x00000000(%rip),%rdi
000000000000000b callq 0x00000010
0000000000000010 xorl %eax,%eax
0000000000000012 popq %rbp
0000000000000013 ret

Whoops, our symbol names are gone! The compiler has replaced them with sets of zero bytes. For the leaq instruction, the result is a load from the current value of rip. The callq instruction is a “signed offset” jump, which means that the offset of 0 calls the very next instruction in the code (address 0x10 in this case). Never fear, the compiler has generated relocation entries which tell the linker where to update all these zeroes (otool -r test.o):

Relocation information (__TEXT,__text) 2 entries
address pcrel length extern type scattered symbolnum/value
0000000c 1 2 1 2 0 4
00000007 1 2 1 1 0 0

The first entry says, “At offset 0xc in the __TEXT,__text section, there is an unscattered, external, PC-relative X86_64_RELOC_BRANCH reference of length ‘long word’ to the symbol at index 4 in the symbol table.” A peek at the symbol table (nm -ap) gives us:

0000000000000014 s L_str
0000000000000048 s EH_frame0
0000000000000000 T _main
0000000000000060 S _main.eh
U _puts

The symbol at index 4 (the fifth entry) is _puts. Similarly, the symbol at index 0 is L_str, which will be relocated at offset 0x7 of the object file (three bytes into the leaq instruction). Finally, let’s look at the result of linking this object into an executable (clang test.c -o test -Os, otool -tv test):

_main:
0000000100000f36 pushq %rbp
0000000100000f37 movq %rsp,%rbp
0000000100000f3a leaq 0x00000029(%rip),%rdi
0000000100000f41 callq 0x100000f4a
0000000100000f46 xorl %eax,%eax
0000000100000f48 popq %rbp
0000000100000f49 ret

ld has:

  • Located the __TEXT segment at the standard executable load address for x86_64, 0x0000000100000000, and the __TEXT,__text section at 0xf36 after that. The first 0xf35 (actually, 0xa0f, since the larger offset doesn’t account for the file’s Mach-O header) bytes of __TEXT are zeroed out. This aligns the __TEXT segment flush up against the __DATA segment. I don’t know exactly why this is done, though I assume it has something to do with cache efficiency.

  • Replaced 0 with the actual offset from the leaq instruction to the L_str symbol, which in this case is 0x29. The resulting address is 0x100000f61, which a peek at the load commands (otool -l test) tells us is the exact beginning of the __TEXT,__cstring section.

  • Replaced 0 with the address of the symbol stub for puts(), which comes immediately after main. Another peek at the load commands puts this in the __TEXT,__stubs section, which we’ll look at in detail later.

Static linking, then, combines object files, resolves symbol references to external libraries, applies the relocations for those symbols, and builds a complete executable. Obviously, this is a huge simplification and applies only to executables. The process of linking dynamic libraries is similar, but not identical, and for brevity’s sake I won’t go into it here.

What does dyld do, anyway?dyld is actually responsible for quite a bit of work, all told. It (in roughly this order):

  • Bootstraps itself based on the very simple raw stack set up for the process by the kernel.

  • Recursively and cachingly loads all dependent dynamic libraries the executable links to into the process’ memory space, including any necessary perusal of search paths from both the environment and the executable’s “runpaths”.

  • Links those libraries into the executable by immediately binding non-lazy symbols and setting up the necessary tables for lazy binding.

  • Runs static initializers for the executable.

  • Sets up the parameters to the executable’s main function and calls it.

  • During the process’ execution, handles calls to lazily-bound symbol stubs by binding the symbols, provides runtime dynamic loading services (via the dl*() API), and provides hooks for gdb and other debuggers to get critical information.

  • Runs static terminator routines after main returns.

  • In some scenarios, makes the required call to libSystem’s _exit routine once main returns.

I’ll examine each step roughly in order.

Bootstrapdyld is the very first code run in a new process. In particular, a symbol by the very descriptive name of __dyld_start is called. This happens due to a bit of magic in the kernel which notices the LC_LOAD_DYLINKER load command in the main executable and uses the given dynamic linker’s entry symbol as the process’ initial instruction pointer. __dyld_start performs the following pseudocode (the actual implementation is a compact bit of assembly code):

noreturn __dyld_start(stack mach_header *exec_mh, stack int argc, stack char **argv, stack char **envp, stack char **apple, stack char **STRINGS)
{
stack push 0 // debugger end of frames marker
stack align 16 // SSE align stack
uint64_t slide = __dyld_start - __dyld_start_static;
void *glue = NULL;
void *entry = dyldbootstrap::start(exec_mh, argc, argv, slide, ___dso_handle, &glue);
if (glue)
push glue // pretend the return address is a glue routine in dyld
else
stack restore // undo stack stuff we did before
goto *entry(argc, argv, envp, apple); // never returns
}

In retrospect, I’m not sure that pseudocode is any more sensible than the assembly would have been, but let’s walk through it quickly:

  • Push a 0 onto the stack, and align the stack to SSE requirements.

  • Calculate the slide of dyld itself by subtracting the address of a symbol whose address is always the same from the current address of __dyld_start.

  • Run dyld’s actual bootstrap routine, which sets up some minimal state for dyld itself (such as pulling in certain functions from libSystem without actually linking to it and setting up Mach messaging) and then runs dyld’s real main routine, which does loading, linking, and initializers.

  • If dyld detected that the main executable uses the LC_MAIN load command to set up its entry point, it returns the address of a glue routine which is responsible for calling _exit when the process is done. That address is pushed onto the stack, fooling the entry point into thinking it’s the routine’s return address; the ret instruction at the end of that function will jump to that glue code.

  • If, on the other hand, dyld detected the executable using the older LC_UNIXTHREAD load command, it simply restores the stack to its original state and jumps to that entry point, which will be the start routine from crt1.o, the C runtime. The C runtime basically redoes all the work that __dyld_start just did, minus the actual dyld startup, which is one of the reasons it was replaced with the LC_MAIN command.

  • Jump to the entry point.

LoadingEach time dyld has to load a dynamic library, whether at application startup or due to a request at runtime, it must locate the correct binary on disk, map the file into memory, parse the Mach-O headers, and record all the data it just generated for use in linking (which in this context means symbol binding). (Boy, “linking” sure has a lot of different uses, doesn’t it?)

Locating the correct binary on disk is usually fairly simple. The LC_LOAD_DYLIB command will give an absolute path, and the binary is loaded from that path. Of course, sometimes that path contains a special marker that tells dyld to look somewhere else:

  • @executable_path - Up to OS X 10.3, this was the only marker dyld supported, and it had rather limited utility. dyld will replace this marker with the full path to the main executable.

  • @loader_path - Added in 10.4, this marker is replaced with the full path to the binary which loaded the binary that is currently being loaded. This is not always the main executable, and primarily enabled frameworks to themselves embed frameworks without resorting to the “umbrella framework” mechanism, which Apple never made entirely public and actively discouraged the use of.

  • @rpath - When this marker was added in 10.5, there was much rejoicing. This marker is replaced in sequence with each “run path” embedded in the binary’s loading binaries (recursively), enabling frameworks and dynamic libraries to finally be built only once and be used for both system-wide installation and embedding without changes to their install names, and allowing applications to provide alternate locations for a given library, or even override the location specified for a deeply embedded library.

There are also default search paths, and in some circumstances, further paths can be specified in the environment and load commands.

LinkingOnce a dynamic library is loaded into a process (ignoring for now some manipulations related to address space randomization, and also setting aside code signing issues), its non-lazy symbols must be bound.

At this point, I should take a moment out to explain the different between lazy and non-lazy symbols. It’s not complicated; a lazy symbol’s binding is deferred until the symbol is called the first time by the executable, while a non-lazy symbol is bound immediately when its containing library is loaded. The actual binding process is identical; the only difference is in how that process is triggered.

Conceptually, binding a symbol is simple. In practice, it’s rather interesting:

  • Look up, in the binding information of the __LINKEDIT segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for _puts was at 0xf4a (plus some, I’m shortening for simplicity’s sake!). If we were to disassemble the machine code at that address, we would get: Contents of (__TEXT,__stubs) section 0000000100000f4a jmp *0x000000c0(%rip) Contents of (__TEXT,__stub_helper) section 0000000100000f50 leaq 0x000000b1(%rip),%r11 0000000100000f57 pushq %r11 0000000100000f59 jmp *0x000000a1(%rip) 0000000100000f5f nop 0000000100000f60 pushq $0x00000000 0000000100000f65 jmp 0x100000f50 Wow, a nice simple jump instruction! Unfortunately, it’s not quite as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is…

Look up, in the binding information of the __LINKEDIT segment of the executable, the address of the symbol stub for the symbol. Taking our example from above, the stub for _puts was at 0xf4a (plus some, I’m shortening for simplicity’s sake!). If we were to disassemble the machine code at that address, we would get:

Contents of (__TEXT,__stubs) section
0000000100000f4a jmp *0x000000c0(%rip)
Contents of (__TEXT,__stub_helper) section
0000000100000f50 leaq 0x000000b1(%rip),%r11
0000000100000f57 pushq %r11
0000000100000f59 jmp *0x000000a1(%rip)
0000000100000f5f nop
0000000100000f60 pushq $0x00000000
0000000100000f65 jmp 0x100000f50

Wow, a nice simple jump instruction! Unfortunately, it’s not quite as simple as replacing the target of the jump with the address of the symbol, since the jump can only be a signed 32-bit offset and the symbol could (and should!) be anywhere in the 64-bit address space. So, the next step is…

  • Look up, also in the binding information, the address of the symbol pointer for puts in the __DATA,__nl_symbol_ptr section. If this is a lazy symbol, look it up in the __DATA,__la_symbol_ptr section instead. In our example executable, these sections look simply like this (using a hybrid of otool’s output): Contents of (__DATA,__nl_symbol_ptr) section 0000000100001000 dq 0x0000000000000000 0000000100001008 dq 0x0000000000000000 Contents of (__DATA,__la_symbol_ptr) section 0000000100001010 dq 0x0000000100000f60 In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!

Look up, also in the binding information, the address of the symbol pointer for puts in the __DATA,__nl_symbol_ptr section. If this is a lazy symbol, look it up in the __DATA,__la_symbol_ptr section instead. In our example executable, these sections look simply like this (using a hybrid of otool’s output):

Contents of (__DATA,__nl_symbol_ptr) section
0000000100001000 dq 0x0000000000000000
0000000100001008 dq 0x0000000000000000
Contents of (__DATA,__la_symbol_ptr) section
0000000100001010 dq 0x0000000100000f60

In short, the non-lazy symbol pointers are just zero bytes, and the lazy symbol pointer points right back to the stub helper section!

  • Update the address of the symbol pointer in the appropriate __DATA section to the real address of the symbol in the loaded library. You’re done!

So what, you may be asking, are all this crazy indirection and all these extra sections all about?

Well, for non-lazy symbols, the indirection is necessary for two reasons. First, you can’t put writable data in the __TEXT section, which is executable code. This means you can’t update the jump instruction directly at runtime, even if you had a jump instruction that took an absolute 64-bit address. Secondly, you can’t put executable code in the __DATA section, which is writable data! So you can’t just put a 64-bit jump instruction there either. As a result, the jump instruction is encoded to take an extra level of indirection, as with dereferencing a pointer in C.

All this is true of lazily-bound symbols as well, but with a few caveats. dyld does not immediately bind such a symbol, but just leaves it be. The address saved in the lazy symbol pointer by the static linker isn’t a simple 0, but rather points to the “stub helper”. The stub helper is a bit of code embedded in the __TEXT,__stub_helper section (really? who’d’ve guessed?) which pushes the offset into the lazy symbol pointer table to update onto the stack and jumps to the (not lazily bound!) symbol for dyld’s internal symbol binder. It doesn’t show up in this very simple example, but the stub helper grows by two instructions for each lazy symbol so that the correct offset is passed to dyld. When the lazy binding is finished, the symbol pointer is updated as usual, and the stub helper is never called again for that symbol.

Static initializers, static terminators, and runtime servicesMost of the interesting stuff has already happened at this point. dyld will run any static initializers in the executable (most often constructors for global C++ objects and +load methods for Objective-C classes, though there are also attribute((constructor)) functions for plain C). A list of initializers is stored in a separate __DATA,__mod_init_func section in the binary, and is simply a set of addresses into the __TEXT,__text section which dyld calls in order of appearance. Initializer functions are passed the same arguments as main.

When the process exits, dyld will also run static terminators, which mostly means static destructors for C++ objects and attribute((destructor)) functions. These are handled just like static initializers, except that they’re stored in __DATA,__mod_term_func and take no parameters. Static terminators run in the same context as an atexit() function.

Finally, dyld provides runtime services to binaries it has loaded. The dl*() APIs are the preferred interface to dyld’s services (and as of 10.5, the only sanctioned interface; the old functions have been deprecated):

  • dlopen - Performs the load stage of loading a dynamic library, can optionally partially or completely perform the bind stage.

  • dlsym - Look up a symbol in a dynamic library (or the entire process). At its simplest, this is no more than a “name to address” lookup.

  • dladdr - The inverse of dlsym, transforming an address into a set of symbol information.

  • dlclose - Unloads a dynamic library from the process, if no other handles to it are in use. Unloading invalidates all the symbols provided by the dynamic library and can be something of a touchy operation, particularly in an Objective-C environment.

What’s missingWhile I’ve gone over quite a bit, I’ve also left out a lot of information in this article:

  • Two-level namespaces, which prevent trivial symbol collisions in dynamic libraries

  • The dyld shared cache, which maintains a systemwide map of already-loaded dynamic libraries for fast binding

  • Rebasing

  • Code signing

  • Dynamic library linking

  • dyld’s expansive set of environment variables

  • “Restricted” binaries (particularly setuid binaries)

  • Most of the kernel’s interaction with dyld

  • Compression and encryption in Mach-O binaries

  • How dyld itself is built

  • Symbol interposing

  • dyld’s operation on i386 and ARM, which is conceptually the same, but both architectures differ significantly in the details

  • Details of the Mach-O binary format

  • How “fat” binaries are handled

I’ve left these out for two reasons: One, I was a bit behind when writing this article and just didn’t have time to put it all in, and two, there really isn’t space in one article for all that. However, all of these concepts are at least somewhat documented by Apple, and both the kernel and dyld are open-source. Here are what I hope are some useful links (warning, some of these are pretty outdated, as Apple doesn’t seem too interested in updating the documentation):

Apple’s Mach-O documentation Apple’s Mach-O reference The Mach-O “loader” header, a very good reference (also look at other files in the mach-o/ directory) Apple’s dyld Reference The dlopen(3) manpage dyld’s Release Notes dyld’s source code as of 10.8.2 Kernel source code as of 10.8.2 (look at bsd/kern/kern_exec.c and bsd/kern/mach_loader.c in particular)

Conclusiondyld is one of the most essential parts of OS X; without it, nothing but the kernel would run. With that responsibility inevitably comes significant complexity, and dyld has it aplenty. Some of that complexity comes from the massive backwards-compatibility requirements of dyld, and some simply from the sheer scope of the tasks it must handle. Most developers will have no need to understand linking in such detail, but maybe the next time you get a strange error message in Xcode from the linker, you’ll have a better idea of where to look for the problem. Then again, maybe not; ld can be pretty obstructive.

That’s all I have for you this week. Come back next week for a special treat from Mike; his next article is particularly awesome!