OSAtomic 原子操作导览

文章發布時間 2011年3月4日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2011-03-04: A Tour of OSAtomic · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2011-03-04-a-tour-of-osatomic.html 发布：2011-03-04　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

又到了疯狂的周五问答时间。今天 Paul Kim 建议我带大家探索一下 OSAtomic——Mac OS X 内置的底层无锁线程安全操作函数集合。鉴于周三刚发布双核 iPad 2，他的建议可谓极具前瞻性。

多线程编程与原子操作
正如所有多线程编程者所知，这项工作极其困难，真的非常困难。多线程执行会导致多个线程交互时产生高度不可预测的时序。其结果是，那些看似正确并通过测试的代码可能存在隐蔽缺陷，导致其偶尔神秘地失效。

编写多线程代码时有许多规则需要遵循，但最重要的一条是：必须对共享数据的所有访问进行加锁保护。

共享数据指的是多个线程都可能访问的数据。如果你存在任何共享数据，就必须使用锁来同步访问。另一种多线程编程的思路是放弃共享数据，转而使用消息传递机制 —— 这意味着任意时刻只有一个线程会访问消息及任何相关共享数据。

原子操作允许你绕过这个要求，在不使用锁（locks）的情况下访问共享数据。这里的「atomic」（原子）一词不是用于炸弹或反应堆的意义，而是取自古希腊语中 “indivisible”（不可分割）的含义。一个原子操作（atomic operation）是一个一次性完成的操作，并且在进行中不能被另一个线程（thread）中断。OSAtomic 是 OS X 的此类原子操作库。OSAtomic 函数的头文件（header）位于 OSAtomic.h 头文件中，路径为 /usr/include/libkern/OSAtomic.h。（译注：现代系统中路径可能已变化，且该库可能已被其他库取代）注意，尽管有 kern 部分，这些函数在常规用户空间代码（userland code）中完全可用且有用。你可以通过以下方式导入文件：

1
    #import <libkern/OSAtomic.h>

整数操作
基础操作
自旋锁
队列
内存屏障

在开始之前，我想简要讨论一下内存屏障（memory barrier）。前两个类别中的所有函数都有两个变体。一个是基础函数，另一个是同名函数但在名称末尾添加了 Barrier。我暂时忽略这两类函数之间的差异，将在关于内存屏障的章节中详细说明。在此之前，我只陈述一个关于变体的简单事实：使用 Barrier 变体代替基础变体总是安全的，唯一的问题是会有轻微的性能损失，因此除非你非常清楚自己在做什么，否则请使用 Barrier 变体。

关于数据类型和对齐的简要说明 由于原子操作的细微和底层特性，在数据类型和对齐方面存在显著限制。原子操作仅适用于 32 位和 64 位数据类型，并且在某些平台上（据我所知，PowerPC 是唯一一个 OS X 上这样的平台）它们仅适用于 32 位。始终使用具有确定大小的数据类型，例如 int32_t，而不是像 int 这样的内置类型。

值还必须按其大小进行对齐。这意味着值的地址必须是其大小的整数倍。通常这会由系统自动完成。在 OS X 上，对象和内存分配都按 16 字节边界对齐，编译器会确保各个数据成员都在该范围内分配。只有在自行处理地址尝试设置自己的打包结构或使用紧凑结构体（packed structs）时，才需要担心对齐问题。另外，请注意不要对这类情况使用原子操作。

整数操作（Integer Operations）
这些是头文件中首先出现的操作，其工作方式类似于标准 C 的整数运算。它们都会就地修改值，因此实际上等同于 C 的增强赋值运算符（如 +=）。

既然 C 已经有这些运算符，为什么还需要这些函数？与这个头文件的其他部分一样，原子性（atomicity）是关键所在。考虑以下代码：

1
    x += 1;

1
    fetch value of x
2
    add 1
3
    store new value of x

+=运算符的原子等效操作是 OSAtomicAdd32（在支持 64 位值的平台上也有对应的 64 位版本）。它接受一个待相加的数值和一个指向要修改值的指针。该函数还会返回新值，这使得你能够更好地在不同线程之间进行协调。例如，你可能会让多个线程各自处理数组中的一个元素。通过利用返回值，你可以为每个线程分配一个唯一的索引：

1
    // initialization, before the threads go
2
    int32_t sharedIndex = 0;
3

4
    // in each thread to get the index
5
    int index = OSAtomicAdd32(1, &sharedIndex) - 1;
6
    // subtract 1 because it returns the NEW value

基础操作 上述的整数操作，至少从概念上讲，都是建立在单一的基础原子操作（atomic operation）之上的：比较并交换（compare and swap）。一个比较并交换函数接受一个旧值、一个新值和一个变量指针。当且仅当变量当前的值与旧值相匹配时，它才会用新值替换变量的内容。它还会返回操作是否成功（即变量匹配旧值）或失败（变量不匹配）。为了更清晰地说明，请想象一个像这样的函数：

1
    bool CompareAndSwap(int old, int new, int *value)
2
    {
3
        if(*value == old)
4
        {
5
            *value = new;
6
            return true;
7
        }
8
        else
9
            return false;
10
    }

这为我们提供了构建更复杂、实用的原子操作的基础组件。使用比较并交换（compare and swap）的基本模型是一种事务模型，类似于数据库中的事务。从概念上讲，你编写代码来开启一个事务，执行本地修改，然后尝试提交该事务。提交操作通过比较并交换来完成。如果提交失败 —— 由比较并交换返回 false 表示 —— 则回到起点并开始一个新的事务重新尝试。

OSAtomic 通过 OSAtomicCompareAndSwap 系列函数提供了这一操作。它有针对 32 位整数、指针以及 int 和 long 类型的版本。在支持的平台上，还有一个针对 64 位整数的版本。

为了了解如何使用这些函数，让我们考虑如何用比较并交换来编写 OSAtomicAdd32。它同样采用事务模型。首先，获取原始值。然后进行加法运算得到新值。最后，使用原始值和新值执行比较并交换。如果失败，则返回重试整个过程。代码如下：

1
    int32_t OSAtomicAdd32(int32_t howmuch, volatile int32_t *value)
2
    {
3
        bool success;
4
        int32_t new;
5

6
        do {
7
            int32_t orig = *value;
8
            new = orig + howmuch;
9
            success = OSAtomicCompareAndSwap32(orig, new, value);
10
        } while(!success);
11

12
        return new;
13
    }

例如，下面是一个将节点添加到链表头部的函数：

1
    void AddNode(ListNode *node, ListNode * volatile *head)
2
    {
3
        bool success;
4

5
        do {
6
            ListNode *orig = *head;
7
            node->next = orig;
8
            success = OSAtomicCompareAndSwapPtrBarrier(orig, node, (void *)head);
9
        } while(!success);
10
    }

这是一个用于 “偷取” 链表的伴随函数。该函数将原链表替换为空链表（即 NULL），并返回旧链表头，以便对其执行操作：

1
    ListNode *StealList(ListNode * volatile *head)
2
    {
3
        bool success;
4
        ListNode *orig;
5

6
        do {
7
            orig = *head;
8
            success = OSAtomicCompareAndSwapPtrBarrier(orig, NULL, (void *)head);
9
        } while(!success);
10

11
        return orig;
12
    }

ABA 问题你可能会想，为什么我选择实现 StealList 而不是，比如说，RemoveNode。答案在于 RemoveNode 比看起来更难。你或许认为可以这样实现它：

1
    ListNode *RemoveNode(ListNode * volatile *head)
2
    {
3
        bool success;
4
        ListNode *orig;
5

6
        do {
7
            orig = *head;
8
            ListNode *next = orig->next;
9
            success = OSAtomicCompareAndSwapPtrBarrier(orig, next, (void *)head);
10
        } while(!success);
11

12
        return orig;
13
    }

1
    A -> B -> C

1
    C

1
    A -> C

这是一个相当罕见的情形，但正因其罕见性，反而变得很严重。编写多线程代码时最不希望出现的情况就是造出偶尔才失效的代码。代码要么经常失效，要么根本不失效，否则一旦出问题将极难追踪和修复。

测试并设置
测试并设置（test and set）是一种相对专业化的基础原子操作。它不如比较并交换（compare and swap）实用，但在某些场景下仍具价值。我个人从未遇到过使用它的时机，该操作通常局限于实现锁和信号量。一般而言，使用这些原语的高级抽象比自己尝试实现它们更可取。

自旋锁
自旋锁（spinlock）是一种不依赖任何操作系统设施的原始锁类型。锁通常是一种提供线程间互斥能力的机制：两个线程尝试获取同一个锁，一个成功获取，另一个则等待；当第一个线程解锁后，第二个线程便能获得该锁。

一般而言，当第二个线程处于等待状态时，我们希望它能被阻塞，这样它在阻塞期间就不会占用任何 CPU 时间。这需要操作系统的介入来停止线程，并在解锁时重新启动它。这种操作系统介入会带来一定程度的开销，而这并非总是可取的。

自旋锁（spinlock）极其轻量级，且完全在用户态（userland）中运行。其缺点在于，当一个线程在等待时，它不会被阻塞，而是会持续检查自旋锁，直到其解锁。当锁没有争用（即同一时间只有一个线程在访问）时，自旋锁性能非常出色，但当锁被长时间争用时，其表现会很差。

自旋锁是一种基本值，类型为 OSSpinLock。你将一个指向自旋锁的指针传递给 OSSpinLockTry、OSSpinLockLock 或 OSSpinLockUnlock 来完成各种操作。就基本语义而言，这些与 pthread_mutex 或 NSLock 相同，只是实现方式不同。通常你应使用那些更高级的抽象，但当性能绝对关键且争用情况罕见时，自旋锁就非常有用。

队列
这其实有点名不副实，因为实际提供的设施是一个栈（stack），而不是队列（queue）。不过，既然 OSAtomic 把它们命名为 OSQueue，那就暂且称为 “队列” 吧。

遗憾的是，经过进一步调查，我发现 OSQueue 并非完全线程安全，因此不应使用。由于我无法确定此问题是否以及何时会被修复，你应该避免使用 OSQueue。（译注：在现代系统中，相关 API 可能已有变化或已被弃用。）

内存屏障
多线程编程的一大挑战在于，不仅要应对各种时序和操作系统问题导致的怪异行为，硬件有时也会引发问题。某些架构会为了提升速度而重新排列内存读写操作。这些重排对程序本身是透明的，但对于同时在其他 CPU 上执行的代码则不是。这可能会导致严重问题。

举一个可能引发麻烦的示例场景，考虑以下代码：

1
    volatile Structure *gStructure;
2

3
    // thread 1
4
    Structure *s = malloc(sizeof(s));
5
    s->field1 = 0;
6
    s->field2 = 42;
7
    gStructure = s;
8

9
    // thread 2
10
    Structure *s;
11
    while((s = gStructure) == NULL)
12
        /* poll */;
13
    printf("%d\n", s->field2);

内存屏障（memory barrier）是一种解决该问题的方式。内存屏障强制要求所有屏障前的读取操作必须在任何屏障后的读取操作完成之前全部完成。写入操作同样遵循此规则。技术上，读取和写入可以有独立的屏障，但 OSAtomic 将二者合并为单一概念。

如果你只需要一个基础的内存屏障，可以通过 OSMemoryBarrier 函数实现。该函数无需参数且无返回值，其唯一作用就是充当内存屏障。因此，上述代码可改写为：

1
    // thread 1
2
    Structure *s = malloc(sizeof(s));
3
    s->field1 = 0;
4
    s->field2 = 42;
5
    OSMemoryBarrier();
6
    gStructure = s;
7

8
    // thread 2
9
    Structure *s;
10
    while((s = gStructure) == NULL)
11
        /* poll */;
12
    OSMemoryBarrier();
13
    printf("%d\n", s->field2);

除了这个函数，OSAtomic 还为所有原子操作提供了内存屏障（memory barrier）。所有原子函数的屏障变体（Barrier variants）意味着它们不仅完成指定的原子操作，还包含一个内存屏障。当你执行的原子操作涉及超出你正在操作的单个数据块的更广泛数据时，这非常有用。

作为一般经验法则，独立计数器、标志位以及存在于你正在操作的单个 32 位或 64 位数值内的其他自包含数据，通常不需要屏障。任何涉及原子操作来指示未包含在操作数值中的其他数据状态变化的情况，都需要使用屏障。

当不确定时，就使用屏障。在不需要时使用屏障的唯一缺点是会产生少量性能开销。

结论
多线程编程是困难的。共享数据使其更加困难。不加锁的共享数据则进一步增加了难度。然而，在某些情况下，这种编程方式是有用的，并且当你绝对需要无锁线程代码时，OSAtomic 为你提供了构建基石。

今天的内容就到这里了。再过 14 天，你的友好邻里「Friday Q & A」将继续推出下一期精彩内容。一如既往，Friday Q & A 是由读者驱动的，所以如果你想看到某个主题在这里被探讨，请发送给我们！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2011-03-04-a-tour-of-osatomic.html

It’s time for another crazy edition of Friday Q&A. Today, Paul Kim has suggested that I give a tour of OSAtomic, OS X’s built-in low-level functions for lockless thread-safe operations. Given the announcement Wednesday of the dual-core iPad 2, this was a particularly prescient suggestion on his part.

Threaded Programming and Atomic Operations As anyone who does threaded programming knows, it’s difficult. Really difficult. Threaded execution results in highly unpredictable timing with how multiple threads interact with each other. The result is that code which appears to be correct and passes tests can be subtly flawed in such a way that it fails rarely and mysteriously.

There are many rules to follow when writing threaded code, but the most important one is this: lock all access to shared data.

Shared data is any data which more than one thread can access. If you have any, then you must use locks to synchronize that access. Another approach to threaded programming is to forego shared data and instead use message passing, which means that only one thread at a time will be accessing the message and any shared data.

Atomic operations allow you to sidestep this requirement and access shared data without using locks. The word “atomic” here is not used in the sense of bombs or reactors, but in the ancient Greek sense of “indivisible”. An atomic operation is one which happens all in one shot, and which can’t be interrupted by another thread while in progress. OSAtomic is OS X’s library of such atomic operations.

The Header OSAtomic functions are located in the OSAtomic.h header, located at /usr/include/libkern/OSAtomic.h. note that despite that kern part, these functions are perfectly usable and useful from regular userland code. You can import the file with:

1
    #import <libkern/OSAtomic.h>

Integer operations
Fundamental operations
Spinlocks
Queues
Memory barriers

Before I begin that, I want to briefly discuss memory barriers. All of the functions in the first two categories have two variants. One is a basic function, and the other is that same function but with Barrier at the end of the name. I will ignore the difference between these two types for now, and cover it all in the section about memory barriers. Until then, I will state one simple fact about the variants: it’s always safe to use the Barrier variant instead of the base variant, with the only problem being a small performance penalty, so use Barrier unless you really know what you’re doing.

A Quick Note on Data Types and Alignment Due to the finnicky and low-level nature of atomic operations, there are significant limits on data types and alignments. Atomic operations are only available for 32-bit and 64-bit data types, and on some platforms (PowerPC being the only OS X one that I know of) they are only available for 32-bit. Always use data types with a guaranteed size, such as int32_t, rather than built-in types ilke int.

The values must also be aligned to their size. This means that the address of the value has to be a multiple of the value’s size. Normally this is done for you. Objects and memory allocations are aligned on 16-byte boundaries on OS X, and the compiler ensures that the individual data members are allocated within that. Alignment should only be a worry if you’re messing around with addresses trying to set up your own packing or using packed structs. And, well, don’t use atomic operations with those.

Integer Operations The integer operations are the first ones found in the header and they operate similarly to standard C integer operations. They all modify a value in-place, so they are really equivalent to the C augmented assigment operators like +=.

Since C already has these operators, why are these functions needed? As with the rest of this header, atomicity is the name of the game. Consider the following code:

1
    x += 1;

1
    fetch value of x
2
    add 1
3
    store new value of x

The atomic equivalent of += is OSAtomicAdd32 (as well as one for 64-bit values on platforms that support it). It takes an amount to add and a pointer to the value to modify. It also returns the new value, which allows you to better coordinate between threads. For example, you might have several threads that each want to work on a single element of an array. By using the return value, you can assign a unique index to each one:

1
    // initialization, before the threads go
2
    int32_t sharedIndex = 0;
3

4
    // in each thread to get the index
5
    int index = OSAtomicAdd32(1, &sharedIndex) - 1;
6
    // subtract 1 because it returns the NEW value

There are also convenience functions equivalent to ++ and —, OSAtomicIncrement and OSAtomicDecrement. Using OSAtomicIncrement32 would make the above code slightly simpler.

There are also functions for doing bitwise logical options. The |=, &=, and ^= operators have atomic equivalents in OSAtomicOr, OSAtomicAnd, and OSAtomicXor. These have somewhat more exotic uses, but can be handy for manipulating bitfields in a thread-safe manner.

Fundamental Operations The above integer operations are all, at least conceptually, built on top of a single fundamental atomic operation: compare and swap. A compare and swap function takes an old value, a new value, and a pointer to a variable. It replaces the variable’s contents with the new value if and only if the variable currently matches the old value. It also returns whether the operation succeeded (the variable matched the old value) or failed (the variable didn’t match). To make this more clear, imagine a function like this:

1
    bool CompareAndSwap(int old, int new, int *value)
2
    {
3
        if(*value == old)
4
        {
5
            *value = new;
6
            return true;
7
        }
8
        else
9
            return false;
10
    }

This gives us the building blocks to create more complex and useful atomic operations. The basic model of using compare and swap is that of a transaction, as you might find in databases. Conceptually, you build code that begins a transaction, carries out a local modification, then attempts to commit the transaction. The commit is done using a compare and swap. If the commit fails, which is indicated by the compare and swap returning false, then you go back to the beginning and start a new transaction to try it again.

OSAtomic provides this operation with the OSAtomicCompareAndSwap family of functions. There is one for 32-bit integers, for pointers, and for int and long. There is also one, on platforms which support it, for 64-bit integers.

To see how you can use these, let’s consider how you would write OSAtomicAdd32 using compare and swap. Again, it works using a transaction model. First, fetch the original value. Then add to obtain a new value. Finally, use compare and swap with the original and new values. If it failed, go back and try everything again. In code:

1
    int32_t OSAtomicAdd32(int32_t howmuch, volatile int32_t *value)
2
    {
3
        bool success;
4
        int32_t new;
5

6
        do {
7
            int32_t orig = *value;
8
            new = orig + howmuch;
9
            success = OSAtomicCompareAndSwap32(orig, new, value);
10
        } while(!success);
11

12
        return new;
13
    }

For example, here is a function which adds a node to the head of a linked list:

1
    void AddNode(ListNode *node, ListNode * volatile *head)
2
    {
3
        bool success;
4

5
        do {
6
            ListNode *orig = *head;
7
            node->next = orig;
8
            success = OSAtomicCompareAndSwapPtrBarrier(orig, node, (void *)head);
9
        } while(!success);
10
    }

Here is a companion function which “steals” the list. This replaces the list with an empty list (which is to say, NULL) and returns the old list head so that it can be operated on:

1
    ListNode *StealList(ListNode * volatile *head)
2
    {
3
        bool success;
4
        ListNode *orig;
5

6
        do {
7
            orig = *head;
8
            success = OSAtomicCompareAndSwapPtrBarrier(orig, NULL, (void *)head);
9
        } while(!success);
10

11
        return orig;
12
    }

ABA Problem You might wonder why I implemented StealList instead of, say, RemoveNode. The answer is that RemoveNode is harder than it looks. You might think that you could implement it like so:

1
    ListNode *RemoveNode(ListNode * volatile *head)
2
    {
3
        bool success;
4
        ListNode *orig;
5

6
        do {
7
            orig = *head;
8
            ListNode *next = orig->next;
9
            success = OSAtomicCompareAndSwapPtrBarrier(orig, next, (void *)head);
10
        } while(!success);
11

12
        return orig;
13
    }

1
    A -> B -> C

1
    C

1
    A -> C

This is a fairly rare scenario, but that very rarity makes it serious. The last thing you want when writing multithreaded code is to create code which fails rarely. It had better fail frequently or not at all, otherwise it’s going to be really hard to track down and fix.

Test and Set Test and set is a somewhat specialized fundamental atomic operation. It is not as useful as compare and swap, but can be handy for certain things. I have not ever had an occasion to use it myself, though, and it’s generally limited to implementing locks and semaphores. It’s generally better to use higher-level abstractions of those than to try to implement them yourself.

Spinlocks A spinlock is a primitive type of lock that does not use any OS facilities. A lock in general is a facility which provides mutual exclusion between threads. Two threads attempt to acquire a lock. One succeeds, the other waits. When the first one unlocks the lock, the second one then acquires it.

Generally, when the second thread is waiting, we want it to be blocked so that it does not take any CPU time while it’s blocked. This requires intervention by the OS to stop the thread, and start it again when unlocking. This OS intervention comes with a certain amount of overhead that is not always desirable.

Spinlocks are extremely lightweight and operate entirely in userland. The downside is that when a thread waits, it’s not blocked, but continues to check the spinlock until it comes unlocked. Spinlocks perform very well when a lock is not contended (only one thread at a time is accessing it) but perform poorly when a lock is contended for an extended period.

A spinlock is a primitive value of type OSSpinLock. You then pass a pointer to the spinlock to OSSpinLockTry, OSSpinLockLock, or OSSpinLockUnlock to accomplish the various operations. In terms of the basic semantics, these are the same as those of pthread_mutex or NSLock, only the implementation differs. You should generally use those higher-level abstractions, but spin locks are useful when performance is absolutely critical and contention is rare.

Queues This is something of a misname, as the facility provided is actually a stack, not a queue. However, OSAtomic calls them OSQueue, so “queues” they are.

Unfortunately, after some additional investigation, I discovered that OSQueue is not entirely thread safe and thus should not be used. Since I have no idea if or when this will be fixed, you should avoid the use of OSQueue.

Memory Barriers A major challenge with multithreaded programming is dealing with the fact that, not only is there a great deal of strange behavior due to timing and OS issues, but the hardware sometimes causes problems as well. Some architectures reorder memory reads and writes for extra speed. These reorderings are hidden from the program by the CPU, but they are not hidden from code executing on other CPUs at the same time. This can cause serious problems.

For an example scenario where this could cause trouble, consider the following code:

1
    volatile Structure *gStructure;
2

3
    // thread 1
4
    Structure *s = malloc(sizeof(s));
5
    s->field1 = 0;
6
    s->field2 = 42;
7
    gStructure = s;
8

9
    // thread 2
10
    Structure *s;
11
    while((s = gStructure) == NULL)
12
        /* poll */;
13
    printf("%d\n", s->field2);

Memory barriers are a way to solve this. A memory barrier forces all all reads before the barrier to complete before any reads after the barrier complete. The same goes for writes. Technically, there can be separate barriers for reads and writes, but OSAtomic rolls them both into a single concept.

If you just want a plain memory barrier, you can get one by using OSMemoryBarrier. This function takes no parameters and returns nothing, as its sole purpose is to act as a memory barrier. The above code could then be restructured as:

1
    // thread 1
2
    Structure *s = malloc(sizeof(s));
3
    s->field1 = 0;
4
    s->field2 = 42;
5
    OSMemoryBarrier();
6
    gStructure = s;
7

8
    // thread 2
9
    Structure *s;
10
    while((s = gStructure) == NULL)
11
        /* poll */;
12
    OSMemoryBarrier();
13
    printf("%d\n", s->field2);

In addition to this function, OSAtomic also offers memory barriers with all of its atomic operations. The Barrier variants of all the atomic functions means that they not only accomplish the given atomic operation, but also incorporate a memory barrier. This is extremely useful when you’re performing an atomic operation which has implications about data beyond the single chunk that you’re operating on.

As a general rule of thumb, standalone counters, flags, and other self-contained bits of data which exist within the single 32-bit or 64-bit value that you’re operating on don’t need barriers. Anything where the atomic operation signals something about data not included in the value that you’re operating on needs a barrier.

When in doubt, use a barrier. The only downside to using one when you don’t need it is a small performance cost.

Conclusion Threaded programming is difficult. Shared data makes it harder. Shared data without locks makes it even harder still. However, there are cases where it can be useful to program that way, and when you absolutely need lockless threaded code, OSAtomic gives you the building blocks.

That’s it for today. Come back in another 14 days for the next exciting edition of your friendly neighborhood Friday Q&A. As always, Friday Q&A is driven by the readers, so if you have an idea for a topic you would like to see covered here, send it in!