GCD 实战演练 | TommyWu's Lab

文章發布時間 2009年9月25日

作者 TommyWu

標籤

译文 · 原文： Friday Q&A 2009-09-25: GCD Practicum · 作者 Mike Ash

原文：https://www.mikeash.com/pyblog/friday-qa-2009-09-25-gcd-practicum.html 发布：2009-09-25　作者：Mike Ash 译者：MiMo（mimo-v2.5-pro）；代码块保留英文原样

欢迎回到新一期的 Friday Q & A。我今天要前往 C4（希望在那里见到你！），但已提前准备好了这篇文章，这样所有困在家里（或者更糟，在工作）的人至少能读到些有趣的内容。过去四周我介绍了 Grand Central Dispatch（GCD，大中央调度）并讨论了它提供的各种功能。在第一部分中，我讲述了 GCD 的基础以及如何使用调度队列（dispatch queues）。在第二部分中，我讨论了如何利用 GCD 从多核机器中榨取更多性能。在第三部分中，我探讨了 GCD 的事件分发机制，在第四部分中，我处理了之前未涵盖的各种零碎问题。本周我将考察一个实际应用案例：使用 GCD 加速大量图片缩略图的生成，这个主题由 Willie Abrams 提出。

概述我将分四个步骤来逐步介绍这个程序的并行化（parallelization）过程。第一步是基础的串行程序，后续步骤则讲解如何使用 GCD 将其构建成一个完全并行的程序。如果你想跟着操作，可以获取全部四个步骤的完整源代码。不过不要运行 imagegcd2.m。稍后你会明白原因。

原始程序

我们要处理的程序非常简单，它遍历 ~/Pictures 的内容，并为其中的所有文件生成缩略图。这是一个纯粹的命令行程序，尽管它使用 Cocoa 来完成大部分工作。这是它的 main 函数的样子：

1
int main(int argc, char **argv) {
2
    @autoreleasepool {
3
        NSArray *contents = [[NSFileManager defaultManager] contentsOfDirectoryAtPath: @"/Users/mikeash/Pictures" error: NULL];
4
        for (NSString *file in contents) {
5
            // ... 处理每个文件 ...
6
        }
7
    }
8
    return 0;
9
}

1
    int main(int argc, char **argv)
2
    {
3
        NSAutoreleasePool *outerPool = [NSAutoreleasePool new];
4

5
        NSApplicationLoad();
6

7
        NSString *destination = @"/tmp/imagegcd";
8
        [[NSFileManager defaultManager] removeItemAtPath: destination error: NULL];
9
        [[NSFileManager defaultManager] createDirectoryAtPath: destination
10
                                        withIntermediateDirectories: YES
11
                                        attributes: nil
12
                                        error: NULL];
13

14

15
        Start();
16

17
        NSString *dir = [@"~/Pictures" stringByExpandingTildeInPath];
18
        NSDirectoryEnumerator *enumerator = [[NSFileManager defaultManager] enumeratorAtPath: dir];
19
        int count = 0;
20
        for(NSString *path in enumerator)
21
        {
22
            NSAutoreleasePool *innerPool = [NSAutoreleasePool new];
23

24
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
25
            {
26
                path = [dir stringByAppendingPathComponent: path];
27

28
                NSData *data = [NSData dataWithContentsOfFile: path];
29
                if(data)
30
                {
31
                    NSData *thumbnailData = ThumbnailDataForData(data);
32
                    if(thumbnailData)
33
                    {
34
                        NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg", count++];
35
                        NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
36
                        [thumbnailData writeToFile: thumbnailPath atomically: NO];
37
                    }
38
                }
39
            }
40

41
            [innerPool release];
42
        }
43

44
        End();
45

46
        [outerPool release];
47
    }

朴素的并行化（Naïve Parallelization）

乍一看，这似乎很容易并行化。循环的每次迭代都可以被推送到一个 GCD 全局队列（global queue）上。我们可以通过使用一个调度组（dispatch group）在最后等待所有迭代完成。最后一个技巧是：为了确保每次迭代仍然为文件名获取一个唯一的数字，我们将使用 OSAtomicIncrement32 来原子性地递增 count。新的代码看起来是这样的：

1
    dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
2
    dispatch_group_t group = dispatch_group_create();
3
    __block uint32_t count = -1;
4
    for(NSString *path in enumerator)
5
    {
6
        dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
7
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
8
            {
9
                NSString *fullPath = [dir stringByAppendingPathComponent: path];
10

11
                NSData *data = [NSData dataWithContentsOfFile: fullPath];
12
                if(data)
13
                {
14
                    NSData *thumbnailData = ThumbnailDataForData(data);
15
                    if(thumbnailData)
16
                    {
17
                        NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
18
                                                   OSAtomicIncrement32(&count;)];
19
                        NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
20
                        [thumbnailData writeToFile: thumbnailPath atomically: NO];
21
                    }
22
                }
23
            }
24
        });
25
    }
26
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

如果你无视我的警告并运行了它，你可能只是在重启电脑后重新加载此页面。如果你还没有运行，那么会发生什么（至少在你有很多图片的情况下）就是你的电脑会死机，除非你等待比你预想中更久的时间，否则很可能无法修复。

问题所在
是什么导致了所有这些麻烦？问题出在 GCD（Grand Central Dispatch）的智能调度机制上。GCD 在一个全局线程池中运行任务，该线程池的规模会根据系统负载进行动态调整。例如，我的电脑有四个核心，因此如果我让 GCD 处理大量工作，GCD 会运行四个工作线程（worker thread）来使每个核心满载。如果我电脑上的其他程序开始工作，GCD 会稍微缩减线程数量，为其他任务腾出空间。

然而，GCD 也可以增加活动线程的数量。它会在某个工作线程发生阻塞（blocks）时这样做。想象一下，这四个工作线程正在运行，突然其中一个做了比如读取文件这样的操作。它会去等待磁盘响应，而你的核心利用率就会下降。GCD 会察觉到这种情况，并生成另一个工作线程来填补空缺。

现在，思考一下这里发生了什么。主循环（main loop）正以极快的速度将任务（jobs）推送到全局队列（global queue）上。GCD（Grand Central Dispatch，大中枢派发）会先启动一些工作线程（worker threads），并开始从队列中取出任务执行。这些任务前期只执行极其微量的工作，然后就立即转向从磁盘读取文件 —— 一块缓慢的、旋转的磁盘。

我们也不要忘记磁盘的另一个重要特性：除非你使用的是 SSD（固态硬盘）或高端的 RAID（磁盘阵列），否则它们在并发访问下性能会大幅下降。

最初的四个任务同时访问磁盘，导致磁盘在试图同时满足四个请求时不堪重负。GCD 仅观察 CPU 使用率，看到 CPU 核心（CPU cores）大部分时间处于空闲状态，于是开始生成更多的工作线程。这些新线程同样会撞上磁盘瓶颈（slam into the disk wall），导致 GCD 生成更多线程，如此循环往复。

最终文件读取开始完成。此时，不再是四个核心对应四个线程，而是出现了数百个线程。GCD（大中央调度）会在工作线程占用过多 CPU 时间时进行缩容，但其缩容能力受限。它无法在任务执行中途终止工作线程，甚至无法暂停它们。必须等待整个任务完成后，才能终止承载该任务的线程。所有这些挂起的处理中任务阻碍了 GCD 减少工作线程数量。

这数百个线程开始完成图像数据读取并进入处理阶段。它们在 CPU 资源上相互干扰，尽管 CPU 处理资源争用的能力远强于磁盘。问题在于，这些线程获取文件数据后的首要操作是解码。若涉及大量 JPEG 图像，图像数据体积将膨胀十倍甚至更多。当数百个此类任务同时进行时，内存将很快耗尽。物理内存耗尽时会发生什么？更多的磁盘使用随之而来！

现在你会陷入一个恶性反馈循环。磁盘争用导致更多工作线程，进而引发更多内存使用，又加剧磁盘争用。该进程会持续失控，直到 GCD 达到其 512 个工作线程的上限。对于典型图片尺寸而言，512 个并行任务足以让系统陷入需要漫长恢复的交换地狱。很可能你会在相当长一段时间内都无法终止该任务。

这是使用 GCD 时必须警惕的问题。GCD 在限制 CPU 使用的并发任务数方面表现优异，但对于其他资源的争用则无能为力。若你的任务涉及 I / O 操作或任何可能长时间阻塞的环节，就需要防范此问题。

解决方案
整个问题的根源在于 I / O 争用导致的失控反馈。消除争用即可解决问题。

GCD 通过自定义队列（custom queues）使这一解决变得简便。自定义队列本质是串行执行的。若我们专门为 I / O 创建自定义队列，将所有文件读写操作置于该队列，磁盘每次只会处理一个文件，争用便不复存在。

以下是改用 I / O 队列重构后的程序主循环：

1
    dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
2
    dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
3
    dispatch_group_t group = dispatch_group_create();
4
    __block uint32_t count = -1;
5
    for(NSString *path in enumerator)
6
    {
7
        if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
8
        {
9
            NSString *fullPath = [dir stringByAppendingPathComponent: path];
10

11
            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
12
                NSData *data = [NSData dataWithContentsOfFile: fullPath];
13
                if(data)
14
                    dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
15
                        NSData *thumbnailData = ThumbnailDataForData(data);
16
                        if(thumbnailData)
17
                        {
18
                            NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
19
                                                       OSAtomicIncrement32(&count;)];
20
                            NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
21
                            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
22
                                [thumbnailData writeToFile: thumbnailPath atomically: NO];
23
                            }));
24
                        }
25
                    }));
26
            }));
27
        }
28
    }
29
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

问题在于它本质上是不稳定的，因为各部分没有同步。这段代码中的数据流是这样的：

1
    Main Thread          IO Queue            Concurrent Queue
2

3
    find paths  ------>  read  ----------->  process
4
                                             ...
5
                         write <-----------  process

现在想象一台机器，其硬盘读取文件的速度快于 CPU 的处理速度。这并不难想象：尽管 CPU 速度更快，但它也在执行更多的工作。从硬盘读取的数据开始在队列中堆积。这些数据会占用内存，如果你有很多大图片，可能会占用相当可观的内存。

然后，你的物理内存耗尽并开始交换。

这可能会导致另一个像第一个例子那样的失控的反馈循环（runaway feedback loop）。如果有任何原因导致工作线程（worker thread）阻塞，GCD 会派遣新的线程，而新线程会立即尝试分配大量内存并因持续的内存压力而阻塞。GCD 会派遣更多的任务，造成更大的内存压力，于是你又回到了交换的地狱循环中。

这个反馈有趣的地方在于，与第一次 GCD 尝试不同，它在一定程度上是自我调节的。随着 IO 竞争飙升至顶峰，IO 队列将会停滞，直到情况恢复理智之前不会取得任何显著进展。一旦恢复，你就会回到低内存使用率和良好吞吐量的状态，直到缓冲数据再次积累过多。

最终结果是程序在流畅运行和陷入卡顿之间交替进行。

需要注意的是，即使磁盘速度较慢，同样的问题仍可能出现，因为缩略图数据仍会在运行后期被缓冲，但由于数据量小得多，严重程度可能会大为减轻。

彻底解决问题
由于上次尝试的问题在于不同操作阶段之间缺乏同步，我们需要让它们同步化。最简单的方法是使用「信号量（semaphore）」来限制任何时刻同时进行的任务数量。

但还有一个问题：我们应该允许多少任务同时进行？

显然，这个数字应随系统 CPU 核心数调整，因为我们希望充分利用可用资源。但简单地限制为 CPU 核心数并非良策，因为每个任务的大量时间都消耗在 IO 操作上。同时并行数也不能设置过高，否则会导致内存耗尽。

我决定将并发任务数设定为 CPU 核心数的两倍。基于这样的考量：这个数量能够应对 IO 耗时与处理耗时相当的情况。如果 IO 耗时长于处理时间，那么 IO 本身就会成为瓶颈，此时设置超过此数量的并发任务已无意义。若 IO 耗时显著少于处理时间，Grand Central Dispatch 会自动保持较少的工作线程数，以确保 CPU 资源争用最小化。

现在主循环的逻辑看起来是这样：

1
    dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
2

3
    int cpuCount = [[NSProcessInfo processInfo] processorCount];
4
    dispatch_semaphore_t jobSemaphore = dispatch_semaphore_create(cpuCount * 2);
5

6
    dispatch_group_t group = dispatch_group_create();
7
    __block uint32_t count = -1;
8
    for(NSString *path in enumerator)
9
    {
10
        WithAutoreleasePool(^{
11
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
12
            {
13
                NSString *fullPath = [dir stringByAppendingPathComponent: path];
14

15
                dispatch_semaphore_wait(jobSemaphore, DISPATCH_TIME_FOREVER);
16

17
                dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
18
                    NSData *data = [NSData dataWithContentsOfFile: fullPath];
19
                    dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
20
                        NSData *thumbnailData = ThumbnailDataForData(data);
21
                        if(thumbnailData)
22
                        {
23
                            NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
24
                                                       OSAtomicIncrement32(&count;)];
25
                            NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
26
                            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
27
                                [thumbnailData writeToFile: thumbnailPath atomically: NO];
28
                                dispatch_semaphore_signal(jobSemaphore);
29
                            }));
30
                        }
31
                        else
32
                            dispatch_semaphore_signal(jobSemaphore);
33
                    }));
34
                }));
35
            }
36
        });
37
    }
38
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

性能基准测试
我在一个包含 7913 张图片的库上获得了以下运行时间数据：

有趣的是，版本 3 的表现相当不错。我确实观察到它出现了之前讨论过的循环行为，但频率不高。很可能这是因为我的机器有 15GB 内存。在内存配置较低的系统上，它的表现可能会明显变差。我观察到它曾一度消耗高达 10GB 的内存。如果我将其编译为 32 位版本，它很快就会耗尽虚拟内存并崩溃。版本 4 则从未使用过显著的内存。

结论
GCD（Grand Central Dispatch，大中央调度）是一项出色的技术，能完成许多有用的工作，但它不能替你包办一切。特别是，执行 IO 操作且可能占用大量内存的并发任务（concurrent jobs）必须被谨慎管理。即便如此，GCD 所提供的工具使得构建一个不会耗尽计算机资源的系统变得容易。

本周的周五 Q & A 到此结束。下周请继续收看另一个精彩版本。同时，欢迎发送你的话题供我们讨论！

#Original (English)

Source: https://www.mikeash.com/pyblog/friday-qa-2009-09-25-gcd-practicum.html

Welcome back to another Friday Q&A. I’m off to C4 today (hope to see you there!) but I’ve prepared this in advance so everyone stuck at home (or worse, work) can at least have something interesting to read. Over the past four weeks I’ve introduced Grand Central Dispatch and discussed the various facilities it provides. In Part I I talked about the basics of GCD and how to use dispatch queues. In Part II I discussed how to use GCD to extract more performance from multi-core machines. In Part III I discussed GCD’s event dispatching mechanism, and in Part IV I took care of various odds and ends that I hadn’t covered before. This week I’m going to examine a practical application of using GCD to speed up the production of thumbnails for a large quantity of images, a topic suggested by Willie Abrams.

Overview I’m going to walk through the parallelization of this program in four steps. The first step will be the basic serialized program, and the following steps work through building it into a fully parallel program using GCD. If you’d like to follow along, you can get the full source code for all four steps. Don’t run imagegcd2.m though. You’ll see why in a bit.

The Original Program The program that we’re going to work with is a simple thing that goes through the contents of ~/Pictures and generates thumbnails for everything inside. It’s a pure command-line program, albeit using Cocoa to do most of the work. This is what its main function looks like:

1
    int main(int argc, char **argv)
2
    {
3
        NSAutoreleasePool *outerPool = [NSAutoreleasePool new];
4

5
        NSApplicationLoad();
6

7
        NSString *destination = @"/tmp/imagegcd";
8
        [[NSFileManager defaultManager] removeItemAtPath: destination error: NULL];
9
        [[NSFileManager defaultManager] createDirectoryAtPath: destination
10
                                        withIntermediateDirectories: YES
11
                                        attributes: nil
12
                                        error: NULL];
13

14

15
        Start();
16

17
        NSString *dir = [@"~/Pictures" stringByExpandingTildeInPath];
18
        NSDirectoryEnumerator *enumerator = [[NSFileManager defaultManager] enumeratorAtPath: dir];
19
        int count = 0;
20
        for(NSString *path in enumerator)
21
        {
22
            NSAutoreleasePool *innerPool = [NSAutoreleasePool new];
23

24
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
25
            {
26
                path = [dir stringByAppendingPathComponent: path];
27

28
                NSData *data = [NSData dataWithContentsOfFile: path];
29
                if(data)
30
                {
31
                    NSData *thumbnailData = ThumbnailDataForData(data);
32
                    if(thumbnailData)
33
                    {
34
                        NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg", count++];
35
                        NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
36
                        [thumbnailData writeToFile: thumbnailPath atomically: NO];
37
                    }
38
                }
39
            }
40

41
            [innerPool release];
42
        }
43

44
        End();
45

46
        [outerPool release];
47
    }

Naïve Parallelization At first glance this looks pretty easy to parallelize. Each iteration through the loop can be pushed onto a GCD global queue. We can wait for them all to finish at the end by using a dispatch group. One last trick: to ensure that each iteration still gets a unique number for its filename, we’ll use OSAtomicIncrement32 to atomically increment count. This is what the new code looks like:

1
    dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
2
    dispatch_group_t group = dispatch_group_create();
3
    __block uint32_t count = -1;
4
    for(NSString *path in enumerator)
5
    {
6
        dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
7
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
8
            {
9
                NSString *fullPath = [dir stringByAppendingPathComponent: path];
10

11
                NSData *data = [NSData dataWithContentsOfFile: fullPath];
12
                if(data)
13
                {
14
                    NSData *thumbnailData = ThumbnailDataForData(data);
15
                    if(thumbnailData)
16
                    {
17
                        NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
18
                                                   OSAtomicIncrement32(&count;)];
19
                        NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
20
                        [thumbnailData writeToFile: thumbnailPath atomically: NO];
21
                    }
22
                }
23
            }
24
        });
25
    }
26
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

If you ignored my warning and ran it anyway, you’re probably just reloading this page after rebooting your computer. If you haven’t run it, what happens (if you have a lot of pictures, at least) is that your computer locks up and you probably can’t fix it unless you wait much longer than you’d really like to.

The Problem What’s causing all this trouble? The problem lies in GCD’s smarts. GCD runs tasks on a global thread pool whose size is scaled in response to system load. For example, my computer has four cores, and so if I load up GCD with work, GCD will run four worker threads to load every core. If something else on my computer starts doing work, GCD will scale back a bit to give the other task some room.

However, GCD can also increase the number of active threads. It will do this if one of the worker threads blocks. Imagine these four worker threads running and then suddenly one of them does something like, oh, let’s say, read a file. It goes off to wait for the disk, and your cores are being under-utilized. GCD will see this situation and spawn another worker thread to fill the gap.

Now, think about what happens here. The main loop is pushing jobs onto the global queue extremely quickly. GCD will start off with a few worker threads and start popping jobs off the queue. These jobs perform a trifling amount of work up front and then immediately they go off and read a file from the disk. The slow, spinning disk.

And let’s not forget another important property of the disk: unless you have an SSD or a fancy RAID, they get substantially slower under contention.

These first four jobs all hit the disk at the same time, which goes crazy trying to fill all four requests at once. GCD, which only looks at CPU usage, sees that the CPU cores are sitting mostly idle and starts spawning more worker threads. These threads also slam into the disk wall, causing GCD to spawn yet more threads, etc.

Eventually the file reads begin to complete. Now, instead of four threads for the four cores, there are hundreds. GCD will scale back if there are too many worker threads using CPU time, but it’s limited in when it can scale back. It can’t kill worker threads in the middle of a job, and can’t even pause them. It has to wait until an entire job has completed before it can kill the thread that job is on. All of these pending in-flight jobs prevent GCD from reducing the worker thread count.

All these hundreds of threads start to finish reading their image data and begin process. They get in each other’s way on the CPU as well, although the CPU handles contention much better than the disk. The trouble is, the first thing thing these threads do once they have the file data is decode it. If you have a lot of JPEGs, this image data is going to expand by a factor of 10 or more. With hundreds of these things in flight, you’ll start to blow out your memory. What happens when you run out of physical RAM? More disk usage!

Now you have a vicious feedback cycle. Disk contention causes more worker threads, which causes more memory usage, which causes more disk contention. The process runs away until GCD hits its limit of 512 worker threads. With typical picture sizes, 512 in-flight jobs is more than enough to send your system into swap hell from which it will take a long time to recover. Quite likely you won’t even be able to kill the job for quite some time.

This is something you really have to watch out for when using GCD. GCD is great for limiting the number of concurrent jobs for CPU usage, but it will do nothing about contention over other resources. If your jobs do IO or anything else that could block for a while, you need to beware of this problem.

The Fix The root of this whole problem was IO contention leading to runaway feedback. Remove the contention, remove the problem.

GCD makes this easy with custom queues. Custom queues are inherently serialized. If we create a custom queue just for IO and put all file reading/writing onto that queue, then the disk will only be hit up for one file at a time and the contention disappears.

Here’s the main loop of our program redone to use an IO queue:

1
    dispatch_queue_t globalQueue = dispatch_get_global_queue(0, 0);
2
    dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
3
    dispatch_group_t group = dispatch_group_create();
4
    __block uint32_t count = -1;
5
    for(NSString *path in enumerator)
6
    {
7
        if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
8
        {
9
            NSString *fullPath = [dir stringByAppendingPathComponent: path];
10

11
            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
12
                NSData *data = [NSData dataWithContentsOfFile: fullPath];
13
                if(data)
14
                    dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
15
                        NSData *thumbnailData = ThumbnailDataForData(data);
16
                        if(thumbnailData)
17
                        {
18
                            NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
19
                                                       OSAtomicIncrement32(&count;)];
20
                            NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
21
                            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
22
                                [thumbnailData writeToFile: thumbnailPath atomically: NO];
23
                            }));
24
                        }
25
                    }));
26
            }));
27
        }
28
    }
29
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

The problem is that it’s inherently unstable because the different parts are not synchronized. The flow of data in this code looks like this:

1
    Main Thread          IO Queue            Concurrent Queue
2

3
    find paths  ------>  read  ----------->  process
4
                                             ...
5
                         write <-----------  process

Now imagine a machine where the disk is fast enough to read files faster than the CPU can process them. This isn’t all that hard to imagine: although the CPU is much faster, it’s also doing much more work. The data read from the disk begins to pile up in the queue. This data takes up memory, possibly substantial amounts of memory if you have a lot of big pictures.

Then you run out of physical RAM and begin to swap.

This can lead to another runaway feedback loop like the first one. If anything causes the worker thread to block, GCD will spin off a new one, which will immediately start trying to allocate a bunch of memory and block because of the ongoing memory pressure. GCD will spin off more jobs, causing more memory pressure, and you’re back in swap hell again.

What’s interesting about this feedback is that, unlike the first GCD attempt, it’s self-regulating to some extent. As IO contention goes through the roof, the IO queue will come to a halt, and won’t make any significant progress until the situation has regained sanity. Once it does, you’re back to low memory usage and good throughput until the buffered data builds up too far again.

End result: the program alternates between smooth processing and being bogged down.

Note that if the disk is slower the same problem can still occur because the thumbnails will be buffered at the end of the run, but it’s likely to be much less severe because the quantity of data is so much smaller.

Really Fixing the Problem Since the problem with the last attempt was a lack of synchronization between the different phases of the operation, let’s synchronize them. The simple way to do this is to use a semaphore to limit the number of jobs in flight at any given time.

One question remains: how many jobs should we allow?

Obviously it should scale with the number of CPU cores in the system, because we want to take advantage of whatever is available. Simply limiting to the number of CPU cores is a bad idea, though, because much of each job is IO. And it can’t be too high, because then we’ll run out of memory.

I decided on having twice the number of jobs as CPU cores. My reasoning is that this will scale up to the point where IO takes as long as processing. If IO takes longer than processing, then IO will be the bottleneck anyway, and so there’s no sense in having more concurrent jobs than this. If IO takes significantly less time than processing, then GCD will automatically keep the number of worker threads low enough to ensure minimal contention on the CPU.

This is what the main loop now looks ilke:

1
    dispatch_queue_t ioQueue = dispatch_queue_create("com.mikeash.imagegcd.io", NULL);
2

3
    int cpuCount = [[NSProcessInfo processInfo] processorCount];
4
    dispatch_semaphore_t jobSemaphore = dispatch_semaphore_create(cpuCount * 2);
5

6
    dispatch_group_t group = dispatch_group_create();
7
    __block uint32_t count = -1;
8
    for(NSString *path in enumerator)
9
    {
10
        WithAutoreleasePool(^{
11
            if([[[path pathExtension] lowercaseString] isEqual: @"jpg"])
12
            {
13
                NSString *fullPath = [dir stringByAppendingPathComponent: path];
14

15
                dispatch_semaphore_wait(jobSemaphore, DISPATCH_TIME_FOREVER);
16

17
                dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
18
                    NSData *data = [NSData dataWithContentsOfFile: fullPath];
19
                    dispatch_group_async(group, globalQueue, BlockWithAutoreleasePool(^{
20
                        NSData *thumbnailData = ThumbnailDataForData(data);
21
                        if(thumbnailData)
22
                        {
23
                            NSString *thumbnailName = [NSString stringWithFormat: @"%d.jpg",
24
                                                       OSAtomicIncrement32(&count;)];
25
                            NSString *thumbnailPath = [destination stringByAppendingPathComponent: thumbnailName];
26
                            dispatch_group_async(group, ioQueue, BlockWithAutoreleasePool(^{
27
                                [thumbnailData writeToFile: thumbnailPath atomically: NO];
28
                                dispatch_semaphore_signal(jobSemaphore);
29
                            }));
30
                        }
31
                        else
32
                            dispatch_semaphore_signal(jobSemaphore);
33
                    }));
34
                }));
35
            }
36
        });
37
    }
38
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);

Benchmarking I obtained the following runtimes, on a library of 7913 pictures:

It’s interesting that version 3 performed as well as it did. I did observe it exhibiting the cycling behavior I discussed, but not too often. Most likely this is because my machine has 15GB of RAM. On a less well endowed system it’s likely to perform substantially worse. I observed it using up to 10GB of RAM at one point. If I compile it as 32-bit then it rapidly runs out of virtual memory and crashes. Version 4 never uses any significant amout of RAM.

Conclusion GCD is a fantastic piece of technology and does a lot of useful things, but it can’t do everything for you. In particular, concurrent jobs which perform IO and have the potential to use a lot of memory must be managed carefully. Even so, the facilities that GCD provides make it easy to construct a system which will not overwhelm the computer’s resources.

That wraps up this week’s Friday Q&A. Come back next week for another exciting edition. In the mean time, send me your topics to discuss!