(何时)并行排序实用?如何编写高效的并行排序?
我正在开发 D 编程语言的并行化库。现在我对基本原语(并行 foreach、map、reduce 和任务/futures)非常满意,我开始考虑一些更高级别的并行算法。排序是最明显的并行化候选者之一。
我的第一个问题是,排序算法的并行版本在现实世界中是否有用,或者它们主要是学术性的?如果它们有用,那么它们在哪里有用?我个人很少在工作中使用它们,仅仅是因为我通常使用比单个 sort() 调用更粗粒度的并行性级别将所有核心固定在 100%。
其次,对于大型数组来说,快速排序似乎几乎是令人尴尬的并行,但我无法获得我认为应该获得的近线性加速。对于快速排序,唯一固有的串行部分是第一个分区。我尝试并行化快速排序,在每个分区之后,并行对两个子数组进行排序。用简化的伪代码表示:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
即使对于非常大的数组,第一个分区所花费的时间可以忽略不计,与算法的纯串行版本相比,我在双核上也只能获得大约 30% 的加速。我猜瓶颈是共享内存访问。关于如何消除这个瓶颈或者瓶颈可能是什么的任何见解?
编辑:我的任务池有固定数量的线程,等于系统中的核心数量减1(因为主线程也可以工作)。另外,我使用的等待类型是工作等待,即如果任务已启动但未完成,则调用 workWait()
的线程会从池中窃取其他作业并执行它们,直到该任务完成为止。它正在等待完成。如果任务未启动,则在当前线程中完成。这意味着等待并不是低效的。只要有工作要做,所有线程都会保持忙碌状态。
I'm working on a parallelization library for the D programming language. Now that I'm pretty happy with the basic primitives (parallel foreach, map, reduce and tasks/futures), I'm starting to think about some higher level parallel algorithms. Among the more obvious candidates for parallelization is sorting.
My first question is, are parallelized versions of sorting algorithms useful in the real world, or are they mostly academic? If they are useful, where are they useful? I personally would seldom use them in my work, simply because I usually peg all of my cores at 100% using a much coarser grained level of parallelism than a single sort() call.
Secondly, it seems like quick sort is almost embarrassingly parallel for large arrays, yet I can't get the near-linear speedups I believe I should be getting. For a quick sort, the only inherently serial part is the first partition. I tried parallelizing a quick sort by, after each partition, sorting the two subarrays in parallel. In simplified pseudocode:
// I tweaked this number a bunch. Anything smaller than this and the
// overhead is smaller than the parallelization gains.
const smallestToParallelize = 500;
void quickSort(T)(T[] array) {
if(array.length < someConstant) {
insertionSort(array);
return;
}
size_t pivotPosition = partition(array);
if(array.length >= smallestToParallelize) {
// Sort left subarray in a task pool thread.
auto myTask = taskPool.execute(quickSort(array[0..pivotPosition]));
quickSort(array[pivotPosition + 1..$]);
myTask.workWait();
} else {
// Regular serial quick sort.
quickSort(array[0..pivotPosition]);
quickSort(array[pivotPosition + 1..$]);
}
}
Even for very large arrays, where the time the first partition takes is negligible, I can only get about a 30% speedup on a dual core, compared to a purely serial version of the algorithm. I'm guessing the bottleneck is shared memory access. Any insight on how to eliminate this bottleneck or what else the bottleneck might be?
Edit: My task pool has a fixed number of threads, equal to the number of cores in the system minus 1 (since the main thread also does work). Also, the type of wait I'm using is a work wait, i.e. if the task is started but not finished, the thread calling workWait()
steals other jobs off the pool and does them until the one it's waiting on is done. If the task isn't started, it is completed in the current thread. This means that the waiting isn't inefficient. As long as there is work to be done, all threads will be kept busy.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
请记住,我不是并行排序方面的专家,人们以并行排序为研究职业,但是......
1)它们在现实世界中有用吗?
当然,如果您需要对昂贵的东西(例如字符串或更糟)进行排序,并且您没有固定所有核心,那么它们当然是。
2) 快速排序似乎会提供线性加速,但事实并非如此't。分区步骤是一个顺序瓶颈,如果您进行分析,您将会看到这一点,并且在四核上它往往会达到 2-3 倍的上限。
如果您想在较小的系统上获得良好的加速,您需要确保每个任务的开销非常小,理想情况下您需要确保没有运行太多线程,即双线程上的线程数不超过 2 个核。线程池可能不是正确的抽象。
如果您想在更大的系统上获得良好的加速,您需要查看基于扫描的并行排序,有关于这方面的论文。双调排序也很容易并行化,就像合并排序一样。并行基数排序也很有用,PPL 中有一个(如果您不反对 Visual Studio 11)。
Keep in mind I'm not an expert on parallel sort, and folks make research careers out of parallel sort but...
1) are they useful in the real world.
of course they are, if you need to sort something expensive (like strings or worse) and you aren't pegging all the cores.
2) Quicksort seems like it would give a linear speedup, but it isn't. The partition step is a sequential bottleneck, you will see this if you profile and it will tend to cap out at 2-3x on a quad core.
If you want to get good speedups on a smaller system you need to ensure that your per task overheads are really small and ideally you will want to ensure that you don't have too many threads running, i.e. not much more than 2 on a dual core. A thread pool probably isn't the right abstraction.
If you want to get good speedups on a larger system you'll need to look at the scan based parallel sorts, there are papers on this. bitonic sort is also quite easy parallelize as is merge sort. A parallel radix sort can also be useful, there is one in the PPL (if you aren't averse to Visual Studio 11).
我不是专家但是...这是我要考虑的内容:
首先,我听说根据经验,算法可以从作为并行算法,一开始往往会更好地工作。
查看您的实现,尝试使并行/串行切换以另一种方式进行:对数组进行分区并并行排序,直到有 N 段,然后进行串行排序。如果您或多或少为每个并行情况获取一个新线程,那么 N 应该是您的核心数。 OTOH,如果您的线程池具有固定大小并充当短期委托队列,那么我会使用核心数的 N ~ 2+ 倍(这样核心就不会闲置,因为一个分区完成得更快)。
其他调整:
I'm no expert but... here is what I'd look at:
First of all, I've heard that as a rule of thumb, algorithms that look at small bits of a problem from the start tends to work better as parallel algorithms.
Looking at your implementation, try making the parallel/serial switch go the other way: partition the array and sort in parallel until you have N segments, then go serial. If you are more or less grabbing a new thread for each parallel case, then N should be ~ your core count. OTOH if your thread pool is of fixed size and acts as a queue of short lived delegates, then I'd use N ~ 2+ times your core count (so that cores don't sit idle because one partition finished faster).
Other tweaks:
myTask.wait();
at the local level and rather have a wrapper function that waits on all the tasks.“我的第一个问题是,排序算法的并行版本在现实世界中有用吗?” - 取决于您在实际工作中处理的数据集的大小。对于小数据集,答案是否定的。对于较大的数据集,它不仅取决于数据集的大小,还取决于系统的具体架构。
阻碍预期性能提升的限制因素之一是系统的缓存布局。如果数据可以放入某个核心的 L1 缓存中,那么跨多个核心进行排序几乎没有什么好处,因为在排序算法的每次迭代之间都会遭受 L1 缓存未命中的惩罚。
同样的推理也适用于具有多个 L2 缓存和 NUMA(非均匀内存访问)架构的芯片。因此,您想要分布排序的核心越多,smallestToParallelize 常量就需要相应增加。
您确定的另一个限制因素是共享内存访问或内存总线争用。由于内存总线只能满足每秒一定数量的内存访问;拥有除了读取和写入主内存之外基本上不执行任何操作的附加核心会给内存系统带来很大的压力。
我要指出的最后一个因素是线程池本身,因为它可能没有您想象的那么高效。因为您有线程从共享队列中窃取和生成工作,所以该队列需要同步方法;并且根据它们的实现方式,它们可能会导致代码中出现很长的串行部分。
"My first question is, are parallelized versions of sorting algorithms useful in the real world" - depends on the size of the data set that you are working on in the real work. For small sets of data the answer is no. For larger data sets it depends not only on the size of the data set but also the specific architecture of the system.
One of the limiting factors that will prevent the expected increase in performance is the cache layout of the system. If the data can fit in the L1 cache of a core, then there is little to gain by sorting across multiple cores as you incur the penalty of the L1 cache miss between each iteration of the sorting algorithm.
The same reasoning applies to chips that have multiple L2 caches and NUMA (non-uniform memory access) architectures. So the more cores that you want to distribute the sorting across, the smallestToParallelize constant will need to be increased accordingly.
Another limiting factor which you identified is shared memory access, or contention over the memory bus. Since the memory bus can only satisfy a certain number of memory accesses per second; having additional cores that do essentially nothing but read and write to main memory will put a lot of stress on the memory system.
The last factor that I should point out is the thread pool itself as it may not be as efficient as you think. Because you have threads that steal and generate work from a shared queue, that queue requires synchronization methods; and depending on how those are implemented, they can cause very long serial sections in your code.
我不知道这里的答案是否不再适用,或者我的建议是否适用于 D。
无论如何......
假设 D 允许,总是有可能向缓存提供预取提示。有问题的核心请求数据很快(不是立即)需要加载到某个缓存级别。在理想情况下,当核心开始处理数据时,数据将已被获取。预取过程更有可能或多或少地进行,与“冷”获取数据相比,这至少会导致更少的等待状态。
您仍然会受到整体缓存到 RAM 吞吐量的限制,因此您需要组织数据,以便将大量数据存储在核心的独占缓存中,以便在必须处理之前可以在其中花费相当多的时间。写入更新的数据。
代码和数据需要根据缓存行(每个 64 字节的读取单元)的概念进行组织,缓存行是缓存中最小尺寸的单元。这应该导致对于两个核心,需要组织工作,使得内存系统每个核心的工作量(假设 100% 可扩展性)是以前只有一个核心在工作且工作尚未组织时的一半。对于四核来说是四分之一,依此类推。这是一个很大的挑战,但绝不是不可能的,这只是取决于你在重组工作时的想象力有多大。一如既往,有些解决方案是无法想象的……除非有人这样做!
我不知道 WYSIWYG D 与我使用的 C 相比如何,但总的来说,我认为开发可扩展应用程序的过程会因开发人员在实际机器代码生成中影响编译器的程度而得到改善。对于解释性语言,解释器将进行大量的记忆工作,以至于您可能无法从一般的“背景噪音”中辨别出改进。
我曾经编写过一个多线程 shellsort,它在两个核心上的运行速度比在一个核心上快 70%,在三个核心上比在一个核心上快 100%。四核的运行速度比三核慢。所以我知道你面临的困境。
I don't know if answers here are applicable any longer or if my suggestions are applicable to D.
Anyway ...
Assuming that D allows it, there is always the possibility of providing prefetch hints to the caches. The core in question requests that data it will soon (not immediately) need be loaded into a certain cache level. In the ideal case the data will have been fetched by the time the core starts working on it. More likely the prefetch process will be more or less on the way which at least will result in less wait states than if the data were fetched "cold."
You'll still be constrained by the overall cache-to-RAM throughput capacity so you'll need to have organized the data such that so much data is in the core's exclusive caches that it can spend a fair amount of time there before having to write updated data.
The code and data need to be organized according to the concept of cache lines (fetch units of 64 bytes each) which is the smallest-sized unit in a cache. This should result in that for two cores the work needs to be organized such that the memory system works half as much per core (assuming 100% scalability) as before when only one core was working and the work hadn't been organized. For four cores a quarter as much and so on. It's quite a challenge but by no means impossible, it just depends on how imaginative you are in restructuring the work. As always, there are solutions that cannot be conceived ... until someone does just that!
I don't know how WYSIWYG D is compared to C - which I use - but in general I think the process of developing scaleable applications is ameliorated by how much the developer can influence the compiler in its actual machine code generation. For interpreted languages there will be so much memory work going on by the interpreter that you risk not being able to discern improvements from the general "background noise."
I once wrote a multi-threaded shellsort which ran 70% faster on two cores compared to one and 100% on three cores compared to one. Four cores ran slower than three. So I know the dilemmas you face.
我想向您指出外部排序[1],它也面临类似的问题。通常,此类算法主要用于处理大量数据,但它们的要点是将大块分割成更小的且不相关的问题,因此非常适合并行运行。您“只”需要随后将部分结果拼接在一起,这并不完全并行(但与实际排序相比相对便宜)。
外部合并排序对于未知数量的线程也能很好地工作。您只需任意分割工作负载,并在有一个空闲时将 n 个元素的每一块分配给一个线程,直到所有工作单元完成,此时您可以开始将它们连接起来。
[1] http://en.wikipedia.org/wiki/External_sorting
I would like to point you to External Sorting[1] which faces similar problems. Usually, this class of algorithms is used mostly to cope with large volumes of data, but their main point is that they split up large chunks into smaller and unrelated problems, which are therefore really great to run in parallel. You "only" need to stitch together the partial results afterwards, which is not quite as parallel (but relatively cheap compared to the actual sorting).
An External Merge Sort would also work really well with an unknown amount of threads. You just split the work-load arbitrarily, and give each chunk of n elements to a thread whenever there is one idle, until all your work units are done, at which point you can start joining them up.
[1] http://en.wikipedia.org/wiki/External_sorting