Tagged Pointer 字符串

Mike Ash Friday Q&A 中文译文:Tagged Pointer 字符串

作者 TommyWu
封面圖片: Tagged Pointer 字符串

译文 · 原文: Friday Q&A 2015-07-31: Tagged Pointer Strings · 作者 Mike Ash

原文:https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html 发布:2015-07-31 作者:Mike Ash 译者:MiMo(mimo-v2.5-pro);代码块保留英文原样


标签指针(tagged pointers)是一项有趣的技术,用于提升性能并减少内存占用。自 OS X 10.10 起,NSString 开始采用标签指针优化,今天我将探讨其工作原理 —— 这一主题由 Ken Ferry 提出。

概念回顾 对象在内存中是对齐的,其地址始终至少是指针大小的倍数,实践中通常为 16 的倍数。对象指针存储为完整的 64 位整数,但这种对齐特性意味着总有某些位必然为零。

标签指针利用了这一事实:当这些本应为零的位实际非零时,对象指针便被赋予特殊含义。在苹果的 64 位 Objective-C 实现中,最低有效位(即最低位)设为 1(也就是说奇数)的指针会被视为标签指针。此时系统不会执行标准的 isa 指针解引用(isa dereference)来确定类,而是将接下来的三个位视为标签类表(tagged class table)的索引。该索引用于查找标签指针对应的类。剩余的 60 位则交由标签类自行定义用途。

一个简单的应用是将合适的 NSNumber 实例标记为标签指针(tagged pointer),其中的额外 60 位用于存储数值。最低位会设置为 1,接下来的三位设置为 NSNumber 标签类的适当索引。随后的 60 位则可以存储,例如,任何在该空间内的整数值。

从外部看,这样的指针看起来和表现得都与其他对象无异。它能像其他对象一样响应消息,因为 objc_msgSend 了解标签指针。如果你向它请求 integerValue,它会从那 60 位中提取数据并返回给你。然而,你节省了一次内存分配、每次访问时的指针间接寻址,并且引用计数可以成为无操作,因为没有需要释放的内存。对于常用的类,这可以带来显著的性能提升。

NSString 似乎并非 tag pointer(标签指针)的理想候选者,因为它长度可变,而且可能远超标签指针的 60 位容量。然而,标签指针类可以与普通类共存 —— 对某些值使用标签指针,对其他值使用普通指针。以 NSNumber 为例,任何大于 2^60 - 1 的整数都无法容纳在标签指针中,此时就需要在内存中分配普通的 NSNumber 对象来存储。只要创建对象的代码编写得当,整个机制就能顺畅运作。

NSString 可以采用相同的策略。对于能装入 60 位的字符串,它可以创建标签指针;其他字符串则放入普通对象。这种方案假设短字符串的使用频率足够高,能带来可感知的性能提升。实际代码中是否确实如此呢?从苹果的实际实现来看,他们似乎认为答案是肯定的。

可能的实现方案

在探讨苹果的实际做法之前,我们先思考一下如何实现标签化字符串(tagged string)存储。基础很简单:将最低位设为 1,将剩余位设置为对应的标签类索引,再将剩下的位设置为具体内容。如何利用剩下的位是主要问题。我们希望最大化利用可用的 60 位。

一个 Cocoa 字符串在概念上是一串 Unicode 码位(Unicode code point)。共有 1,112,064 个有效 Unicode 码位,因此一个码位需要 21 位来表示。这意味着我们可以将两个码位放入这 60 位中,但会浪费 18 位。我们可以借用其中一些额外位来存储长度,这样标签化字符串就可以包含零个、一个或两个码位。不过,只限于两个码位似乎用处不大。

NSString API 实际上基于 UTF-16 实现,而非原始 Unicode 码点。UTF-16 将 Unicode 表示为一系列 16 位值。最常见的码点位于基本多文种平面(Basic Multilingual Plane, BMP)内,可容纳于单个 16 位值中;而超过 65,535 的码点则需要两个 16 位值。我们可以在可用的 60 位中容纳三个 16 位值,剩余 12 位。借用部分位存储长度信息,将允许我们表示 0 至 3 个 UTF-16 编码单元。这最多能表示三个 BMP 内的码点,或一个超出 BMP 的码点(外加可选的一个 BMP 内码点)。但限制在三个字符内仍然相当紧凑。

应用程序中的大多数字符串可能都是 ASCII 编码的。即使应用被本地化为非 ASCII 语言,字符串的用途也远不止于显示用户界面。它们还用于 URL 组件、文件扩展名、对象键、属性列表(property list)值等诸多场景。UTF-8 编码是一种兼容 ASCII 的编码方式,它将每个 ASCII 字符编码为单个字节,而对于其他 Unicode 码点(code point)则最多使用四个字节。我们可以在分配给我们的 60 位空间中放入七个字节,并剩余 4 位用于存储长度信息。这使得我们的标签化字符串(tagged strings)最多可以容纳七个 ASCII 字符,或者根据具体字符情况,容纳略少一些的非 ASCII 字符。

如果我们针对 ASCII 进行优化,那不妨彻底放弃完整的 Unicode 支持。毕竟,包含非 ASCII 字符的字符串可以使用真正的对象来存储。ASCII 是一种七位编码,那么如果我们每个字符只分配 7 位呢?这样就能在可用的 60 位中存储多达八个 ASCII 字符,并且还有 4 位剩余用于存储长度。这听起来开始变得实用了。应用程序中很可能存在大量纯 ASCII 且长度不超过八个字符的字符串。

能否进一步挖掘?完整的 ASCII 范围包含大量不常用字符,例如许多控制字符和特殊符号。字母数字字符构成了使用主体。那么 6 位编码空间能容纳什么呢?

6 位可表示 64 个值。ASCII 字母表中有 26 个字母,包含大小写后增至 52 个,再加上 0-9 数字则达到 62 个。剩余两个位置可分配给空格和句点等常用符号。许多字符串可能仅包含这些字符 —— 按每位 6 位计算,60 位存储空间足以容纳 10 个字符!但等等,我们没有剩余位数存储长度信息。因此要么存储 9 个字符加一个长度值,要么移除 64 个可能值中的一个(我建议删除空格字符),并对短于 10 个字符的字符串采用零值作为终止符。

5 位方案怎么样?这并非完全不合理。例如,很可能存在许多仅包含小写字母的字符串。5 位能表示 32 种可能值。如果包含全部小写字母表,还会多出 6 个值,你可以将其分配给更常见的大写字母、一些符号、数字或某种组合。如果你发现其中某些其他可能性更为常见,你甚至可以移除一些不那么常用的小写字母,比如q。每个字符使用 5 位,如果为长度信息预留空间,可以存储 11 个字符;如果借用一个符号并用作终止符,则可以存储 12 个字符。

我们能更进一步吗?对于固定大小的字母表,5 位大概是你能合理达到的极限。你可以转而采用可变长编码(variable length encoding),例如使用霍夫曼编码(Huffman code)。这将允许常见的字母e比字母q用更少的位来编码。在极端情况下(虽然不太可能),如果你的字符串全是e,理论上每位字符平均只需约 1 位。但这会以更复杂且可能更慢的代码为代价。

苹果采取了哪种方案?让我们一探究竟。

带标记字符串示例 这里是一小段代码,它创建一个带标记字符串(tagged string)并打印其指针值:

NSString *a = @"a";
NSString *b = [[a mutableCopy] copy];
NSLog(@"%p %p %@", a, b, object_getClass(b));

这段 mutableCopy / copy 的操作是必要的,原因有二。首先,尽管像 @"a" 这样的字符串可能以标签指针(tagged pointer)形式存储,但常量字符串(constant strings)永远不会是标签指针。常量字符串必须在不同操作系统版本间保持二进制兼容(binary compatible),而标签指针的内部细节并不保证兼容。只要标签指针始终在运行时由 Apple 的代码生成,这没问题;但如果编译器将它们嵌入到你的二进制文件中,比如常量字符串的情况,这就会失效。因此,我们需要复制常量字符串以获得一个标签指针。

之所以需要 mutableCopy,是因为 NSString 对我们来说太 “聪明” 了,它知道对一个不可变字符串进行 copy 是一个无意义的操作,于是会将原始字符串作为 “拷贝” 返回。常量字符串是不可变的,因此 [a copy] 的结果就是 a 本身。一次可变拷贝(mutable copy)会强制进行实际复制,然后对结果再做一次不可变拷贝(immutable copy),就足以说服系统给我们一个标签指针的字符串。

请注意,你在自己的代码中永远不要依赖这些细节! NSString 代码决定是否为你提供标签指针(tagged pointer)的情况可能随时改变,如果你编写的代码以某种方式依赖于此,那些代码最终将会出错。幸运的是,你必须刻意才会那么做,而所有正常、合理的代码都会运行良好,对标签任何东西都一无所知。

以下是我电脑上输出的内容:

0x10ba41038 0x6115 NSTaggedPointerString

首先看到原始指针,是一个漂亮的整数,表明是一个对象指针。拷贝是第二个值,它的 taggedness(标签化)非常明显。它是一个奇数,这意味着它不可能是有效的对象指针。它还是一个小数字,完全在 64 位 Mac 地址空间开头的未映射且不可映射的 4GB 零页内,这使得它成为对象指针的可能性更加微乎其微。

我们能从这个值 0x6115 中推断出什么?我们知道底部四位是 tagged pointer(标签指针)机制本身的一部分。最低 nybble(半字节)5 是二进制 0101。最低位表示它是一个 tagged pointer。接下来的三位表示 tagged class(标签类)。这里的位是 010,表示 tagged string class(标签字符串类)占据索引 2。不过,这些信息并没有太多用处。

值开头的 61 是有启发性的。十六进制的 61 恰好是小写字母 a 的 ASCII(美国信息交换标准代码)值,这正是字符串包含的内容。看起来这里使用了直接的 ASCII 编码。很方便!

类名清楚地表明了该类的用途,并为研究实现该功能的实际代码提供了一个良好的起点。我们稍后会探讨具体代码,但首先让我们继续进行一些外部检查。

这里有一个循环,它构建形式为abcdef...的字符串,并逐个打印出来,直到不再获得 tagged pointer(标记指针)为止:

NSMutableString *mutable = [NSMutableString string];
NSString *immutable;
char c = 'a';
do {
[mutable appendFormat: @"%c", c++];
immutable = [mutable copy];
NSLog(@"0x%016lx %@ %@", immutable, immutable, object_getClass(immutable));
} while(((uintptr_t)immutable & 1) == 1);

第一次迭代打印:

0x0000000000006115 a NSTaggedPointerString

这与我们之前看到的情况一致。注意,我打印出了带所有前导零的完整指针,以便在后续迭代比较时更加清晰。让我们来与第二次迭代进行比较:

0x0000000000626125 ab NSTaggedPointerString

如我们所料,底部的四个比特位没有变化。那个 5 将保持不变,始终表明这是一个 NSTaggedPointerString 类型的 tagged pointer(标签指针)。

原来的 61 停留在原位,现在又出现了一个 62。62 当然是字母 ‘b’ 的 ASCII code(ASCII 码),所以我们可以看出这种编码是一种使用 ASCII 的八位编码。末尾 5 之前的那四个比特位从 1 变成了 2,这暗示它可能表示长度。后续的迭代证实了这点:

0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString

原以为这就是终点了。标签指针(tagged pointer)已经饱和,下一次迭代就会分配一个对象并终止循环,对吧?错了!

0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fd275800030 abcdefghij __NSCFString

循环又经过了两次迭代才终止。长度部分持续增加,但标签指针的其余部分变成了乱码。这是怎么回事?让我们转向实现代码一探究竟。

反汇编

NSTaggedPointer 类位于 CoreFoundation 框架中。它本应位于 Foundation 中,但如今许多核心 Objective-C 类已被迁移到 CoreFoundation—— 随着苹果逐渐放弃让 CoreFoundation 保持独立实体的地位。

我们首先来看 -[NSTaggedPointerString length]: 的实现:

push rbp
mov rbp, rsp
shr rdi, 0x4
and rdi, 0xf
mov rax, rdi
pop rbp
ret

Hopper 提供了这个便捷的反编译功能与之配合:

unsigned long long -[NSTaggedPointerString length](void * self, void * _cmd) {
rax = self >> 0x4 & 0xf;
return rax;
}

简单来说,要获取长度,需要提取第 4 至 7 位并返回。这验证了我们上面的观察:终止字节(terminal 5)之前的四位表示了字符串的长度。

NSString 子类的另一个基础方法是 characterAtIndex:。我将跳过冗长的反汇编过程,直接展示 Hopper 的反编译输出,这输出相当易读:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long arg2) {
rsi = _cmd;
rdi = self;
r13 = arg2;
r8 = ___stack_chk_guard;
var_30 = *r8;
r12 = rdi >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = rdi >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= r13) {
rbx = r8;
___CFExceptionProem(rdi, rsi);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + r13 + 0xffffffffffffffc0) & 0xff;
if (*r8 != var_30) {
rax = __stack_chk_fail();
}
return rax;
}

我们来稍微清理一下。前三行只是 Hopper 告诉我们哪些寄存器存放哪些参数。我们可以直接把所有 rsi 替换为 _cmd,把 rdi 替换为 selfarg2 实际上是索引参数,所以把所有 r13 替换为 index。然后我们会去掉 __stack_chk 相关的内容,因为它只是一项安全加固措施,与该方法的实际工作原理无关。经过这样清理后,代码如下:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
r12 = self >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

对象

紧接着第一个 if 语句的是这行代码:

r12 = self >> 0x4 & 0xf

我们可以识别出,这和我们在前面实现 -length 时所见的长度提取代码完全一致。让我们继续把 r12 全部替换为 length

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

查看 if 语句内部,第一行将 self 右移 8 位。底部 8 位是内务信息:tagged pointer(标签指针)标识和字符串长度。剩余部分,我们推测就是实际数据。让我们用 stringData 替换 rbx,使这更清晰。下一行似乎将某种查找表放入 rcx,所以让我们用 table 替换 rcx。最后,rdx 获得 length 值的副本。看起来它之后会用作某种游标,所以让我们用 cursor 替换 rdx。现在我们得到:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

差不多所有东西都有标注了。还剩一个原始寄存器名:rbp。这实际上是帧指针(frame pointer),所以编译器在直接基于帧指针进行索引时耍了些花招。加上常数 0xffffffffffffffbf 是补码 “一切本质上都是无符号整数” 方式下减去 65 的操作。后面又减去 64。这很可能都是栈上的同一个局部变量。鉴于使用了按字节索引,它可能是一个放在栈上的缓冲区。但这很奇怪,因为有一条代码路径只从这个缓冲区读取数据,却从未写入过。这到底是怎么回事?

事实证明,Hopper 忘记反编译外层 if 语句的 else 分支了。相关的汇编代码如下:

mov rax, rdi
shr rax, 0x8
mov qword [ss:rbp+var_40], rax

var_40是 Hopper 反汇编工具中显示偏移量为 64 时的表示方式(40 是 64 的十六进制形式)。我们把这个位置的指针称为 buffer。这个缺失分支的 C 语言版本看起来会像这样:

*(uint64_t *)buffer = self >> 8

我们继续插入该操作,并将其他使用 rbp 访问 buffer 的地方替换为更易读的代码版本,同时添加 buffer 的声明以提醒我们当前情况:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(buffer + index) & 0xff;
return rax;
}

有进展了。但那些复杂的指针操作语句读起来还是有点吃力,实际上它们不过就是数组索引操作。来修正一下:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
buffer[cursor - 1] = table[stringData & 0x3f];
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
buffer[cursor - 1] = table[stringData & 0x1f];
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = buffer[index];
return rax;
}

现在我们有些进展了。

根据长度的不同,可以看到存在三种情况。长度值小于 8 会进入那个缺失的 else 分支,该分支仅将 self 的值移位后存入 buffer。这是纯 ASCII 的情况。此处,index 用于索引 self 的值以提取特定字节,随后将其返回给调用者。由于 ASCII 字符值在 ASCII 范围内与 Unicode 码点(code points)对应,无需额外操作即可正确输出数值。我们之前猜测此情况下字符串存储的是纯 ASCII,这证实了这一点。

长度为 8 或以上的情况呢?如果长度在 8 到 10(0xa)之间,代码会进入一个循环。这个循环提取 stringData 的最低 6 位,将其作为 table 的索引,然后将对应的值复制到 buffer 中。接着它将 stringData 右移 6 位并重复此过程,直到处理完整个字符串。这是一种六位编码(six-bit encoding),其中原始六位值到 ASCII 字符值的映射存储在 table 中。字符串的临时版本在 buffer 中构建,最后的索引操作从中提取所需的字符。

那么长度为 10 或以上的情况呢?此处的代码几乎完全相同,区别在于它每次处理五位(five bits at a time)而不是六位。这是一种更紧凑的编码,允许标记指针字符串(tagged pointer)存储最多 11 个字符,但仅使用一个 32 值的字母表。这将使用 table 的前半部分作为其截断的字母表(truncated alphabet)。

因此我们可以看到标记指针字符串的结构是:

  • 如果长度在 0 到 7 之间,则将字符串存储为原始的八位字符。

  • 如果长度为 8 或 9,使用六位编码存储字符串,使用的字母表为 “eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P - B79AFKEWV_zGJ / HYX”。

  • 如果长度为 10 或 11,使用五位编码存储字符串,使用的字母表为 “eilotrm.apdnsIc ufkMShjTRxgC4013”。

让我们与之前生成的数据进行比较:

0x0000000000006115 a NSTaggedPointerString
0x0000000000626125 ab NSTaggedPointerString
0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString
0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fbad9512010 abcdefghij __NSCFString

0x0022038a01169585 的二进制展开,减去底部八位后,分为六位一块,结果为:

001000 100000 001110 001010 000000 010001 011010 010101

使用这些索引在表格中查找,我们可以看出它确实拼出了” abcdefgh”。同样地,0x0880e28045a54195 的二进制展开,去掉最低八位后分成六位一组,得到的结果是:

001000 100000 001110 001010 000000 010001 011010 010101 000001

我们可以看到这是同一个字符串,只是在末尾加上了 i。

但随后情况就失控了。在这之后,本应切换到五位编码并再生成两个字符串,但它却开始分配长度为 10 的对象。怎么回事?

五位字母表(five-bit alphabet)的字符集极其有限,甚至不包含字母 b!这个字母显然不够常见,无法跻身五位字母表中神圣的 32 个字符之列。让我们再试一次,但这次从字母 c 开始字符串。输出如下:

0x0000000000006315 c NSTaggedPointerString
0x0000000000646325 cd NSTaggedPointerString
0x0000000065646335 cde NSTaggedPointerString
0x0000006665646345 cdef NSTaggedPointerString
0x0000676665646355 cdefg NSTaggedPointerString
0x0068676665646365 cdefgh NSTaggedPointerString
0x6968676665646375 cdefghi NSTaggedPointerString
0x0038a01169505685 cdefghij NSTaggedPointerString
0x0e28045a54159295 cdefghijk NSTaggedPointerString
0x01ca047550da42a5 cdefghijkl NSTaggedPointerString
0x39408eaa1b4846b5 cdefghijklm NSTaggedPointerString
0x00007fbd6a511760 cdefghijklmn __NSCFString

我们现在拥有了长度最高可达 11 的标记字符串。最后两个标记字符串的二进制展开式如下:

01110 01010 00000 10001 11010 10101 00001 10110 10010 00010
01110 01010 00000 10001 11010 10101 00001 10110 10010 00010 00110

这正符合我们的预期。

创建标记字符串(Tagged Strings) 既然我们已经了解了标记字符串的编码方式,我就不对创建它们的代码进行过多详细说明了。相关代码位于一个名为 __CFStringCreateImmutableFunnel3 的私有函数中,该函数在一个庞大的函数体内处理所有可想象的到的字符串创建情况。这个函数包含在 CoreFoundation 的开源版本中(可在 opensource.apple.com 获取),但别兴奋:标记指针字符串(tagged pointer strings)的相关代码并未包含在开源版本中。

这里的代码本质上是上述过程的逆操作。如果字符串的长度和内容符合标记指针字符串所能容纳的范围,它就会逐步构建一个标记指针,其中包含 ASCII、六位编码(six-bit)或五位编码(five-bit)字符。有一个查找表的逆操作。上文中看到的作为常量字符串出现的表,在 Funnel3 函数中是一个名为 sixBitToCharLookup 的全局变量,并且存在一个对应的名为 charToSixBitLookup 的表。

那个奇特的表(That Weird Table) 完整的六位编码表是:

eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX

这里自然引出一个问题:为什么它的顺序如此奇怪?

由于该表格同时适用于六位编码(six-bit encoding)和五位编码(five-bit encoding),它不完全按字母顺序排列是有道理的。使用最频繁的字符应置于前半部分,而使用频率较低的字符则应放在后半部分。这样可以确保最大数量的较长字符串能采用五位编码。

不过,一旦进行了这种划分,各个半区内部的顺序就无关紧要了。这两个半区本身本可以按字母顺序排列,但实际上并没有。

表格开头的几个字母顺序,与英语中字母按频率出现的顺序相似。英语中最常见的字母是 E,然后是 T,接着是 A、O、I、N 和 S。E 恰好位于表格的最开头,其他几个也靠近开头。该表格似乎是按使用频率排序的。它与英语频率的差异可能源于:Cocoa 应用程序中的短字符串并非英语散文中词语的随机选取,而是一种更专业化的语言表达。

我推测 Apple 最初打算使用一种更复杂的变长编码(variable-length encoding),很可能基于霍夫曼编码(Huffman code)。但后来他们发现这太困难,或者不值得费力,又或者只是时间不够,于是退回到上面所见的不那么 ambitious 的版本,即字符串在每字符八位、六位或五位的定长编码(constant-length encoding)之间做出选择。这张奇怪的表格作为残留物留存了下来,如果他们未来决定采用变长编码,这便是一个起点。这纯属猜测,但在我看来情况就是这样。

结语
Tagged pointer 是一项非常酷的技术。字符串是它的一个不寻常的应用,但很明显 Apple 在此投入了大量思考,他们必定看到了显著的益处。了解它的构造方式以及他们如何从极其有限的存储空间中榨取最大价值,是件有趣的事。

今天就到这里。下次再会,继续探索超凡脱俗的奇异领域。Friday Q & A 由读者的建议驱动,因此如果你有任何希望在此探讨的主题想法,请发送过来!


#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html

Tagged pointers are an interesting technology used to increase performance and reduce memory usage. As of OS X 10.10, NSString got the tagged pointer treatment, and today I’m going to take a look at how it works, a topic suggested by Ken Ferry.

RecapObjects are aligned in memory, such that their address is always at least a multiple of the pointer size, and in practice typically a multiple of 16. Object pointers are stored as a full 64-bit integer, but this alignment means that some of the bits will always be zero.

Tagged pointers take advantage of this fact to give special meaning to object pointers where those bits are not zero. In Apple’s 64-bit Objective-C implementation, object pointers with the least significant bit set to one (which is to say, odd numbers) are considered tagged pointers. Instead of doing the standard isa dereference to figure out the class, the next three bits are considered as an index into a tagged class table. This index is used to look up the class of the tagged pointer. The remaining 60 bits are then left up to the tagged class to use as they please.

A simple use for this would be to make suitable NSNumber instances be tagged pointers, with the extra 60 bits used to hold the numeric value. The bottom bit would be set to 1. The next three bits would be set to the appropriate index for the NSNumber tagged class. The following 60 bits could then hold, for example, any integer value that fits within that space.

From the outside, such a pointer would look and act like any other object. It responds to messages like any other object, because objc_msgSend knows about tagged pointers. If you ask it for its integerValue, it will extract the data from those 60 bits and return it to you. However, you’ve saved a memory allocation, a pointer indirection on every access, and reference counting can be a no-op since there’s no memory to free. For commonly used classes, this can make for a substantial performance improvement.

NSString doesn’t seem like a good candidate for tagged pointers, since it’s of variable length and can be much longer than the 60 bits in a tagged pointer. However, a tagged pointer class can coexist with a normal class, using tagged pointers for some values and normal pointers for others. For example, with NSNumber, any integer larger than 2^60 - 1 won’t fit in the tagged pointer, and instead needs to be stored in a normal NSNumber objcet allocated in memory. As long as the code that creates the objects is written appropriately, this all works fine.

NSString could do the same thing. For strings that fit inside 60 bits, it could create a tagged pointer. Other strings would be placed in normal objects. This assumes that small strings are used often enough for this to be a noticeable performance gain. Is that actually the case in real-world code? It appears Apple thinks it is, since they went ahead and implemented it.

Possible ImplementationsBefore we look at how Apple did things, let’s take a moment to think about how we might implement tagged string storage. The basics are simple: set the bottom bit to one, set the remaining bits to the appropriate tagged class index, set the remaining bits to whatever. How to use the remaining bits is the big question. We want to take the maximum advantage of the 60 bits available to us.

A Cocoa string is conceptually a sequence of Unicode code points. There are 1,112,064 valid Unicode code points, so one code point takes 21 bits to represent. That means we could fit two code points into these 60 bits, with 18 bits wasted. We could borrow some of those extra bits to hold a length, so that a tagged string could be zero, one, or two code points. Being limited to only two code points doesn’t seem very useful, though.

The NSString API is actually implemented in terms of UTF-16, not raw Unicode code points. UTF-16 represents Unicode as a sequence of 16 bit values. The most common code points, which occupy the Basic Multilingual Plane or BMP, fit into a single 16 bit value, while code points above 65,535 require two. We could fit three 16 bit values into the 60 bits available, with 12 bits left over. Borrowing some bits for the length would allow us to represent 0-3 UTF-16 code units. This would allow up to three code points in the BMP, and one code point beyond the BMP (plus, optionally, one code point within it). Being limited to three is still pretty tight though.

Most strings in an app are probably ASCII. Even if the app is localized into a non-ASCII language, strings are used for far more than just displaying UI. They’re used for URL components, file extensions, object keys, property list values, and much more. The UTF-8 encoding is an ASCII-compatible encoding that encodes each ASCII character as a single byte, while using up to four bytes for other Unicode code points. We can fit seven bytes in the 60 bits allotted to us, with 4 bits left over to use as a length. This allows our tagged strings to hold seven ASCII characters, or somewhat fewer non-ASCII characters, depending on exactly what they are.

If we’re optimizing for ASCII, we might as well drop full Unicode support altogether. Strings containing non-ASCII characters can use real objects, after all. ASCII is a seven-bit encoding, so what if we allot only 7 bits per character? That lets us store up to eight ASCII characters in the 60 bits available, plus 4 bits left over for the length. This is starting to sound useful. There are probably a lot of strings in an app which are pure ASCII and contain eight characters or fewer.

Can we take it further? The full ASCII range contains a lot of stuff that isn’t frequently used. There are a bunch of control characters, for example, and unusual symbols. Alphanumerics make up most of what’s used. What can we squeeze into 6 bits?

6 bits results in 64 possible values. There are 26 letters in the ASCII alphabet. Including both upper and lower case brings it up to 52. Including all digits 0-9 brings it up to 62. There are two spots to spare, which we might give to, say, space and period. There are probably a lot of strings which only contain these characters. At 6 bits each, we can fit ten in our 60 bits of storage! But wait! We don’t have any leftover bits to store the length. So either we store nine characters plus a length, or we remove one of the 64 possible values (I nominate the space to be the one to go) and use zero as a terminator for strings shorter than ten characters.

How about 5 bits? This isn’t totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q. 5 bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.

Can we take it further? 5 bits is about as far as you can reasonably go with a fixed-size alphabet. You could switch to a variable length encoding, for example using a Huffman code. This would allow the letter e, which is common, to be encoded with fewer bits than the letter q. This might allow for as little as 1 bit per character, in the unlikely case that your string is all es. This would come at the cost of more complex and presumably slower code.

Which approach did Apple take? Let’s find out.

Exercising Tagged StringsHere’s a bit of code that creates a tagged string and prints its pointer value:

NSString *a = @"a";
NSString *b = [[a mutableCopy] copy];
NSLog(@"%p %p %@", a, b, object_getClass(b));

The mutableCopy/copy dance is necessary for two reasons. First, although a string like @“a” could be stored as a tagged pointer, constant strings are never tagged pointers. Constant strings must remain binary compatible across OS releases, but the internal details of tagged pointers are not guaranteed. This works fine as long as tagged pointers are always generated by Apple’s code at runtime, but it would break down if the compiler embedded them in your binary, as would be the case with a constant string. Thus we need to copy the constant string to get a tagged pointer.

The mutableCopy is necessary because NSString is too clever for us, and knows that a copy of an immutable string is a pointless operation, and returns the original string as the “copy.” Constant strings are immutable, so the result of [a copy] is just a. A mutable copy forces it to make an actual copy, and then making an immutable copy of the result is enough to convince the system to give us a tagged pointer string.

Note that you must never depend on these details in your own code! The circumstances in which the NSString code decides to give you a tagged pointer could change at any time and if you write code that somehow relies on it, that code will eventually break. Fortunately, you have to go out of your way to do that, and all normal, sensible code will work fine, blissfully ignorant of tagged anything.

Here’s what the above code prints on my computer:

0x10ba41038 0x6115 NSTaggedPointerString

You can see the original pointer first, a nice round number indicative of an object pointer. The copy is the second value, and its taggedness is abundantly clear. It’s an odd number, which means it can’t be a valid object pointer. It’s also a small number, well within the unmapped and unmappable 4GB zero page at the beginning of the 64-bit Mac address space, making it doubly imposible to be an object pointer.

What can we deduce from this value of 0x6115? We know that the bottom four bits are part of the tagged pointer mechanism itself. The lowest nybble 5 is 0101 in binary. The bottom one bit indicates that it’s a tagged pointer. The next three bits indicate the tagged class. Here those bits are 010, indicating that the tagged string class occupies index 2. Not that there’s much to do with that information.

The 61 that leads the value is suggestive. 61 in hexadecimal happens to be the ASCII value of the lowercase letter a, which is exactly what the string contains. It looks like there’s a straight ASCII encoding in use. Convenient!

The class name makes it obvious what this class is for, and gives us a good starting point when it comes to looking into the actual code that implements this feature. We’ll get to that shortly, but let’s do some more outside inspection first.

Here’s a loop that builds up strings of the form abcdef… and prints them out one by one until it stops getting tagged pointers back:

NSMutableString *mutable = [NSMutableString string];
NSString *immutable;
char c = 'a';
do {
[mutable appendFormat: @"%c", c++];
immutable = [mutable copy];
NSLog(@"0x%016lx %@ %@", immutable, immutable, object_getClass(immutable));
} while(((uintptr_t)immutable & 1) == 1);

The first iteration prints:

0x0000000000006115 a NSTaggedPointerString

This matches what we saw above. Note that I’m printing out the full pointer with all leading zeroes to make things a bit more clear when comparing subsequent iterations. Let’s compare with the second iteration:

0x0000000000626125 ab NSTaggedPointerString

The bottom four bits didn’t change, as we’d expect. That 5 will remain constant, always indicating that this is a tagged pointer of type NSTaggedPointerString.

The original 61 stays where it was, joined now by a 62. 62 is, of course, the ASCII code for b, so we can see that this encoding is an eight-bit encoding that uses ASCII. The four bits just before the terminal 5 changed from 1 to 2, suggesting that this might be the length. Subsequent iterations confirm this:

0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString

Presumably that’s the end of it. The tagged pointer is full, and the next iteration will allocate an object and terminate the loop. Right? Wrong!

0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fd275800030 abcdefghij __NSCFString

The loop goes through two more iterations before terminating. The length section continues to increase, but the rest of the tagged pointer turns into gibberish. What’s going on? Let’s turn to the implementation to find out.

DisassemblyThe NSTaggedPointer class lives in the CoreFoundation framework. It seems like it ought to live in Foundation, but a lot of the core Objective-C classes have been moved to CoreFoundation these days as Apple slowly gives up on making CoreFoundation an independent entity.

Let’s start by looking at the implementation of -[NSTaggedPointerString length]:

push rbp
mov rbp, rsp
shr rdi, 0x4
and rdi, 0xf
mov rax, rdi
pop rbp
ret

Hopper provides this handy decompilation to go along with it:

unsigned long long -[NSTaggedPointerString length](void * self, void * _cmd) {
rax = self >> 0x4 & 0xf;
return rax;
}

In short, to obtain the length, extract bits 4-7 and return them. This confirms what we observed above, that the four bits just before the terminal 5 indicate the length of the string.

The other primitive method for an NSString subclass is characterAtIndex:. I’ll skip the lengthy disassembly and go straight to Hopper’s decompiled output, which is pretty readable:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long arg2) {
rsi = _cmd;
rdi = self;
r13 = arg2;
r8 = ___stack_chk_guard;
var_30 = *r8;
r12 = rdi >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = rdi >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= r13) {
rbx = r8;
___CFExceptionProem(rdi, rsi);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + r13 + 0xffffffffffffffc0) & 0xff;
if (*r8 != var_30) {
rax = __stack_chk_fail();
}
return rax;
}

Let’s clean this up a little. The first three lines are just Hopper telling us which registers get which arguments. Let’s go ahead and replace all uses of rsi with _cmd and rdi with self. arg2 is actually the index parameter, so let’s replace all uses of r13 with index. And we’ll get rid of the __stack_chk stuff, as it’s just a security hardening thing and not relevant to the actual workings of the method. Here’s what the code looks like when cleaned up in this way:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
r12 = self >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

Right before the first if statement is this line:

r12 = self >> 0x4 & 0xf

We can recognize this as the same length extraction code that we saw in the implementation of -length above. Let’s go ahead and replace r12 with length throughout:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

Looking inside the if statement, the first line shifts self right by 8 bits. The bottom 8 bits are bookkeeping: the tagged pointer indicator, and the string length. The remainder is then, we presume, the actual data. Let’s replace rbx with stringData to make that more clear. The next line appears to put some sort of lookup table into rcx, so let’s replace rcx with table. Finally, rdx gets a copy of the value of length. It looks like it gets used as some sort of cursor later, so let’s replace rdx with cursor. Here’s what we have now:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}

That’s almost everything labeled. One raw register name remains: rbp. That’s actually the frame pointer, so the compiler is doing something tricky indexing directly off the frame pointer. Adding the constant 0xffffffffffffffbf is the two’s-complement “everything is ultimately an unsigned integer” way to subtract 65. Later on, it subtracts 64. This is all probably the same local variable on the stack. Given the bytewise indexing going on, it’s probably a buffer placed on the stack. But it’s weird, because there’s a code path which does nothing but read from that buffer, without ever writing to it. What’s going on?

It turns out that what’s going on is Hopper forgot to decompile the else branch of that outer if statement. The relevant assembly looks like this:

mov rax, rdi
shr rax, 0x8
mov qword [ss:rbp+var_40], rax

var_40 is how Hopper shows that offset of 64 in the disassembly. (40 being the hexadecimal version of 64.) Let’s call the pointer to this location buffer. The C version of this missing branch would look like:

*(uint64_t *)buffer = self >> 8

Let’s go ahead and insert that, and replace the other places where rbp is used to access buffer with more readable versions of the code, plus add a declaration of buffer to remind us what’s going on:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(buffer + index) & 0xff;
return rax;
}

Getting better. All those crazy pointer manipulation statements are a bit hard to read, though, and they’re really just array indexing. Let’s fix those up:

unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
buffer[cursor - 1] = table[stringData & 0x3f];
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
buffer[cursor - 1] = table[stringData & 0x1f];
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = buffer[index];
return rax;
}

Now we’re getting somewhere.

We can see that there are three cases depending on the length. Length values less than 8 go through that missing else branch that just dumps the value of self, shifted, into buffer. This is the plain ASCII case. Here, index is used to index into the value of self to extract the given byte, which is then returned to the caller. Since ASCII character values match Unicode code points within the ASCII range, there’s no additional manipulation needed to make the value come out correctly. We guessed above that the string stored plain ASCII in this case, and this confirms it.

What about the cases where the length is 8 or more? If the length is 8 or more but less than 10 (0xa), then the code enters a loop. This loop extracts the bottom 6 bits of stringData, uses that as an index into table, and then copies that value into buffer. It then shifts stringData down by 6 bits and repeats until it runs through the entire string. This is a six-bit encoding where the mapping from raw six-bit values to ASCII character values is stored in the table. A temporary version of the string is built up in buffer, and the indexing operation at the end then extracts the requested character from it.

What about the case where the length is 10 or more? The code there is almost identical, except that it works five bits at a time instead of six. This is a more compact encoding that would allow the tagged string to store up to 11 characters, but only using an alphabet of 32 values. This will use the first half of table as its truncated alphabet.

Thus we can see that the structure of the tagged pointer strings is:

  • If the length is between 0 and 7, store the string as raw eight-bit characters.

  • If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX”.

  • If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013”

Let’s compare with the data we generated earlier:

0x0000000000006115 a NSTaggedPointerString
0x0000000000626125 ab NSTaggedPointerString
0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString
0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fbad9512010 abcdefghij __NSCFString

The binary expansion of 0x0022038a01169585 minus the bottom eight bits and divided into six-bit chunks is:

001000 100000 001110 001010 000000 010001 011010 010101

Using these to index into the table, we can see that this does indeed spell out “abcdefgh”. Similarly, the binary expansion of0x0880e28045a54195` minus the bottom eight bits and divided into six-bit chunks is:

001000 100000 001110 001010 000000 010001 011010 010101 000001

We can see that it’s the same string, plus i at the end.

But then it goes off the rails. After this point, it should switch to a five-bit encoding and give us two more strings, but instead it starts allocating objects at length 10. What gives?

The five-bit alphabet is extremely limited, and doesn’t include the letter b! That letter must not be common enough to warrant a place in the 32 hallowed characters of the five-bit alphabet. Let’s try this again, but instead start the string at c. Here’s the output:

0x0000000000006315 c NSTaggedPointerString
0x0000000000646325 cd NSTaggedPointerString
0x0000000065646335 cde NSTaggedPointerString
0x0000006665646345 cdef NSTaggedPointerString
0x0000676665646355 cdefg NSTaggedPointerString
0x0068676665646365 cdefgh NSTaggedPointerString
0x6968676665646375 cdefghi NSTaggedPointerString
0x0038a01169505685 cdefghij NSTaggedPointerString
0x0e28045a54159295 cdefghijk NSTaggedPointerString
0x01ca047550da42a5 cdefghijkl NSTaggedPointerString
0x39408eaa1b4846b5 cdefghijklm NSTaggedPointerString
0x00007fbd6a511760 cdefghijklmn __NSCFString

We now have tagged strings all the way up to a length of 11. The binary expansions of the last two tagged strings are:

01110 01010 00000 10001 11010 10101 00001 10110 10010 00010
01110 01010 00000 10001 11010 10101 00001 10110 10010 00010 00110

Exactly what we’re expecting.

Creating Tagged StringsSince we know how tagged strings are encoded, I won’t go into much detail on the code that creates them. The code in question is found within a private function called __CFStringCreateImmutableFunnel3 which handles every conceivable string creation case all in one gigantic function. This function is included in the open source release of CoreFoundation available on opensource.apple.com, but don’t get excited: the tagged pointer strings code is not included in the open source version.

The code here is essentially the inverse of what’s above. If the string’s length and content fits what a tagged pointer string can hold, it builds a tagged pointer piece by piece, containing either ASCII, six-bit, or five-bit characters. There’s an inverse of the lookup table. The table seen above as a constant string lives as a global variable called sixBitToCharLookup, and there’s a corresponding table called charToSixBitLookup in the Funnel3 function.

That Weird TableThe full six-bit encoding table is:

eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX

A natural question here is: why is it in such a strange order?

Because this table is used for both six-bit and five-bit encodings, it makes sense that it wouldn’t be entirely in alphabetical order. Characters that are used most frequently should be in the first half, while characters that are used less frequently should go in the second half. This ensures that the maximum number of longer strings can use the five-bit encoding.

However, with that division in place, the order within the individual halves doesn’t matter. The halves themselves could be sorted alphabetically, but they’re not.

The first few letters in the table is similar to the order in which letters appear in English, sorted by frequency. The most common letter in English is E, then T, then A, O, I, N, and S. E is right at the beginning of the table, and the others are near the beginning. The table appears to be sorted by frequency of use. The discrepancy with English probably arises because short strings in Cocoa apps aren’t a random selection of words from English prose, but a more specialized bit of language.

I speculate that Apple originally intended to use a fancier variable-length encoding, probably based on a Huffman code. This turned out to be too difficult, or not worth the effort, or they just ran out of time, and so they dialed it back to the less ambitious version seen above, where strings choose between constant-length encodings using eight, six, or five bits per character. The weird table lives on as a leftover, and a starting point if they decide to go for a variable-length encoding in the future. This is pure guesswork, but that’s what it looks like to me.

ConclusionTagged pointers are a really cool technology. Strings are an unusual application of it, but it’s clear that Apple put a lot of thought into it and they must see a significant benefit. It’s interesting to see how it’s put together and how they got the most out of the very limited storage available.

That’s it for today. Come back next time for more strange explorations of the trans-mundane. Friday Q&A is driven by reader suggestions, so if you have an idea for a subject you’d like to see covered here, please send it in!