Parallel.ForEach 可能会导致“内存不足”如果使用带有大对象的可枚举,则会出现异常
我正在尝试将数据库中存储图像的数据库迁移到数据库中指向硬盘驱动器上的文件的记录。我试图使用 Parallel.ForEach 来加快进程 使用此方法查询出数据。
但是,我注意到我遇到了 OutOfMemory
异常。我知道 Parallel.ForEach 将查询一批可枚举数,以减轻开销成本(如果有一个用于间隔查询的话)(因此,如果您这样做,您的源将更有可能将下一条记录缓存在内存中一次一堆查询而不是将它们间隔开)。问题是由于我返回的记录之一是 1-4Mb 字节数组,缓存导致整个地址空间被用完(程序必须在 x86 模式下运行,因为目标平台将是 32 位机)
有没有办法禁用缓存或使 TPL 更小?
这是一个显示该问题的示例程序。必须在 x86 模式下进行编译才能显示问题,如果它花费很长时间或没有在您的机器上发生增加数组的大小(我发现 1 << 20
大约需要在我的机器上 30 秒,4 << 20
几乎是瞬时的)
class Program
{
static void Main(string[] args)
{
Parallel.ForEach(CreateData(), (data) =>
{
data[0] = 1;
});
}
static IEnumerable<byte[]> CreateData()
{
while (true)
{
yield return new byte[1 << 20]; //1Mb array
}
}
}
I am trying to migrate a database where images were stored in the database to a record in the database pointing at a file on the hard drive. I was trying to use Parallel.ForEach
to speed up the process using this method to query out the data.
However, I noticed that I was getting an OutOfMemory
Exception. I know Parallel.ForEach
will query a batch of enumerables to mitigate the cost of overhead if there is one for spacing the queries out (so your source will more likely have the next record cached in memory if you do a bunch of queries at once instead of spacing them out). The issue is due to one of the records that I am returning is a 1-4Mb byte array that caching is causing the entire address space to be used up (The program must run in x86 mode as the target platform will be a 32-bit machine)
Is there any way to disable the caching or make is smaller for the TPL?
Here is an example program to show the issue. This must be compiled in the x86 mode to show the issue if it is taking to long or is not happening on your machine bump up the size of the array (I found 1 << 20
takes about 30 secs on my machine and 4 << 20
was almost instantaneous)
class Program
{
static void Main(string[] args)
{
Parallel.ForEach(CreateData(), (data) =>
{
data[0] = 1;
});
}
static IEnumerable<byte[]> CreateData()
{
while (true)
{
yield return new byte[1 << 20]; //1Mb array
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
因此,虽然 Rick 的建议绝对是重要的一点,但我认为缺少的另一件事是 分区。
Parallel::ForEach
将使用默认的Partitioner
实现,对于没有已知长度的IEnumerable
,将使用块分区策略。这意味着Parallel::ForEach
将用于处理数据集的每个工作线程都将从IEnumerable
中读取一定数量的元素,这将然后仅由该线程处理(暂时忽略工作窃取)。这样做是为了节省不断返回源并分配一些新工作并将其安排给另一个工作线程的费用。因此,通常来说,这是一件好事。但是,在您的特定场景中,假设您使用的是四核并且已设置MaxDegreeOfParallelism
为您的工作设置 4 个线程,现在每个线程都会拉一个来自IEnumerable
的 100 个元素块。嗯,对于那个特定的工作线程来说,有 100-400 兆,对吗?那么如何解决这个问题呢?很简单,您编写自定义
Partitioner
实现< /a>.现在,分块在您的情况下仍然有用,因此您可能不想采用单元素分区策略,因为那样您会引入为此所需的所有任务协调的开销。相反,我会编写一个可配置的版本,您可以通过应用程序设置进行调整,直到找到工作负载的最佳平衡。好消息是,虽然编写这样的实现非常简单,但实际上您甚至不必自己编写,因为 PFX 团队已经做到了,并且 将其放入并行编程示例项目。So, while what Rick has suggested is definitely an important point, another thing I think is missing is the discussion of partitioning.
Parallel::ForEach
will use a defaultPartitioner<T>
implementation which, for anIEnumerable<T>
which has no known length, will use a chunk partitioning strategy. What this means is each worker thread whichParallel::ForEach
is going to use to work on the data set will read some number of elements from theIEnumerable<T>
which will then only be processed by that thread (ignoring work stealing for now). It does this to save the expense of constantly having to go back to the source and allocate some new work and schedule it for another worker thread. So, usually, this is a good thing.However, in your specific scenario, imagine you're on a quad core and you've setMaxDegreeOfParallelism
to 4 threads for your work and now each of those pulls a chunk of 100 elements from yourIEnumerable<T>
. Well, that's 100-400 megs right there just for that particular worker thread, right?So how do you solve this? Easy, you write a custom
Partitioner<T>
implementation. Now, chunking is still useful in your case, so you probably don't want to go with a single element partitioning strategy because then you would introduce overhead with all the task coordination necessary for that. Instead I would write a configurable version that you can tune via an appsetting until you find the optimal balance for your workload. The good news is that, while writing such an implementation is pretty straightfoward, you don't actually have to even write it yourself because the PFX team already did it and put it into the parallel programming samples project.这个问题与分区器有关,与并行度无关。解决方案是实现自定义数据分区器。
如果数据集很大,似乎 TPL 的单声道实现可以保证
内存不足。这最近发生在我身上(本质上我正在运行上面的循环,发现内存线性增加,直到它给我一个 OOM 异常)。
跟踪问题后,我发现默认情况下 mono 会划分
使用 EnumerablePartitioner 类的枚举器。这个类有一个
每当它向任务提供数据时,它就会“分块”
数据以不断增加(且不可改变)的因子 2 的倍数增长。所以第一个
当任务请求数据时,它会获取大小为 1 的块,下一次获取大小为 1 的块
2*1=2,下次2*2=4,然后2*4=8,等等。结果是
传递给任务并因此存储在内存中的数据量
同时,随着任务长度的增加,如果数据很多
在处理过程中,不可避免地会出现内存不足的异常。
想必,这种行为的最初原因是它想避免
让每个线程多次返回来获取数据,但似乎是
基于所有正在处理的数据都可以放入内存的假设
(读取大文件时并非如此)。
如前所述,可以使用自定义分区器来避免此问题。一次简单地将数据返回到每个任务的一个通用示例如下:
https://gist.github.com /evolvedmicrobe/7997971
只需先实例化该类并将其交给 Parallel.For 而不是枚举本身
This issue has everything to do with partitioners, not with the degree of parallelism. The solution is to implement a custom data partitioner.
If the dataset is large, it seems the mono implementation of the TPL is guaranteed to
run out of memory.This happened to me recently (essentially I was running the above loop, and found that the memory increased linearly until it gave me an OOM exception).
After tracing the issue, I found that by default mono will divide up the
enumerator using an EnumerablePartitioner class. This class has a
behavior in that every time it gives data out to a task, it "chunks"
the data by an ever increasing (and unchangeable) factor of 2. So the first
time a task asks for data it gets a chunk of size 1, the next time of size
2*1=2, the next time 2*2=4, then 2*4=8, etc. etc. The result is that the
amount of data handed to the task, and therefore stored in memory
simultaneously, increases with the length of the task, and if a lot of data
is being processed, an out of memory exception inevitably occurs.
Presumably, the original reason for this behavior is that it wants to avoid
having each thread return multiple times to get data, but it seems to be
based on the assumption that all data being processed could fit in to memory
(not the case when reading from large files).
This issue can be avoided with a custom partitioner as stated previously. One generic example of one that simply returns the data to each task one item at a time is here:
https://gist.github.com/evolvedmicrobe/7997971
Simply instantiate that class first and hand it to Parallel.For instead of the enumerable itself
虽然使用自定义分区器无疑是最“正确”的答案,但更简单的解决方案是让垃圾收集器赶上。在我尝试的情况下,我在函数内重复调用parallel.for 循环。尽管每次退出该函数,程序使用的内存仍保持线性增加,如此处所述。我补充道:
虽然它不是超级快,但它确实解决了内存问题。据推测,在 CPU 使用率和内存利用率较高时,垃圾收集器无法有效运行。
While using a custom partitioner is undoubtedly the most "correct" answer, a simpler solution is letting the garbage collector catch up. In the case I tried, I was making repeated calls to a parallel.for loop inside a function. Despite exiting the function each time the memory used by the program kept increasing linearly as described here. I added:
and while it is not super fast, it did solve the memory problem. Presumably at high cpu usage and memory utilization the garbage collector doesn't operate efficiently.
Parallel.ForEach
的默认选项仅在任务受 CPU 限制且线性扩展时才有效。当任务受 CPU 限制时,一切都会完美运行。如果您有四核且没有运行其他进程,则 Parallel.ForEach 将使用所有四个处理器。如果您有四核,并且计算机上的其他进程正在使用一个完整的 CPU,则 Parallel.ForEach 大约使用三个处理器。但如果任务不受 CPU 限制,则 Parallel.ForEach 会不断启动任务,努力让所有 CPU 保持忙碌。然而,无论有多少任务并行运行,总会有更多未使用的 CPU 马力,因此它会不断创建任务。
如何判断您的任务是否受 CPU 限制?希望只是通过检查它。如果你对素数进行因式分解,这是显而易见的。但其他情况则不那么明显。判断任务是否受 CPU 限制的经验方法是使用
ParallelOptions.MaximumDegreeOfParallelism
并观察程序的行为方式。如果您的任务受 CPU 限制,那么您应该在四核系统上看到如下模式:ParallelOptions.MaximumDegreeOfParallelism = 4
:使用所有CPU或100%CPU利用率如果它的行为像这样,那么您可以使用默认的
Parallel.ForEach
选项并获得良好的结果。线性CPU利用率意味着良好的任务调度。但是,如果我在 Intel i7 上运行您的示例应用程序,无论我设置什么最大并行度,我都会获得大约 20% 的 CPU 利用率。这是为什么呢?分配的内存过多,导致垃圾收集器阻塞线程。应用程序是资源受限的,资源就是内存。
同样,对数据库服务器执行长时间运行的查询的 I/O 密集型任务也永远无法有效利用本地计算机上的所有可用 CPU 资源。在这种情况下,任务调度程序无法“知道何时停止”启动新任务。
如果您的任务不受 CPU 限制或者 CPU 利用率不随最大并行度线性扩展,那么您应该建议
Parallel.ForEach
不要一次启动太多任务。最简单的方法是指定一个数字,该数字允许重叠 I/O 密集型任务的一定程度的并行性,但又不能过多到超出本地计算机对资源的需求或使任何远程服务器负担过重。为了获得最佳结果,需要反复试验:The default options for
Parallel.ForEach
only work well when the task is CPU-bound and scales linearly. When the task is CPU-bound, everything works perfectly. If you have a quad-core and no other processes running, thenParallel.ForEach
uses all four processors. If you have a quad-core and some other process on your computer is using one full CPU, thenParallel.ForEach
uses roughly three processors.But if the task is not CPU-bound, then
Parallel.ForEach
keeps starting tasks, trying hard to keep all CPUs busy. Yet no matter how many tasks are running in parallel, there is always more unused CPU horsepower and so it keeps creating tasks.How can you tell if your task is CPU-bound? Hopefully just by inspecting it. If you are factoring prime numbers, it is obvious. But other cases are not so obvious. The empirical way to tell if your task is CPU-bound is to limit the maximum degree of parallelism with
ParallelOptions.MaximumDegreeOfParallelism
and observe how your program behaves. If your task is CPU-bound then you should see a pattern like this on a quad-core system:ParallelOptions.MaximumDegreeOfParallelism = 1
: use one full CPU or 25% CPU utilizationParallelOptions.MaximumDegreeOfParallelism = 2
: use two CPUs or 50% CPU utilizationParallelOptions.MaximumDegreeOfParallelism = 4
: use all CPUs or 100% CPU utilizationIf it behaves like this then you can use the default
Parallel.ForEach
options and get good results. Linear CPU utilization means good task scheduling.But if I run your sample application on my Intel i7, I get about 20% CPU utilization no matter what maximum degree of parallelism I set. Why is this? So much memory is being allocated that the garbage collector is blocking threads. The application is resource-bound and the resource is memory.
Likewise an I/O-bound task that performs long running queries against a database server will also never be able to effectively utilize all the CPU resources available on the local computer. And in cases like that the task scheduler is unable to "know when to stop" starting new tasks.
If your task is not CPU-bound or the CPU utilization doesn't scale linearly with the maximum degree of parallelism, then you should advise
Parallel.ForEach
not to start too many tasks at once. The simplest way is to specify a number that permits some parallelism for overlapping I/O-bound tasks, but not so much that you overwhelm the local computer's demand for resources or overtax any remote servers. Trial and error is involved to get the best results: