大批量处理的多线程问题
我们有一个批处理过程,由每行数据(总共 2000 万行)上发生的大约 5 次计算组成。我们的生产服务器将拥有大约 24 个具有不错 CPU 的处理器。
性能对我们来说至关重要。假设我们的算法非常高效,那么实现最大时间性能的最佳方法是什么?具体来说,我们是否应该能够通过多线程、使用线程池等来获得更好的性能?另外,使用 Process 对象将批次划分为多个程序是否有益?
We have a batch process consisting of about 5 calculations that happens on each row of data (20 million rows total). Our production server will have around 24 processors with decent CPUs.
Performance is critical for us. Assuming that our algorithms are pretty efficient, what would be the best way to achieve maximum time performance for this? Specifically, should we be able to achieve better performance through multi-threading, using threadpools, etc? Also, could use of the Process object to divide up the batch into multiple programs be of benefit?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
一些想法:
首先,您需要对“最佳”更加明确 - 执行如此大规模的处理需要权衡。具体来说,内存、I/O 和 CPU 利用率是需要考虑的因素。每次计算需要多少内存。等等。
假设您是机器上唯一的进程,您有大量内存,并且您主要对优化吞吐量感兴趣,这里有一些建议:
除了线程池之外,还有任务并行库,它提供了简化此类并行计算开发的工具。它专门设计用于扩展核心数量并优化线程的使用方式。还有并行 LINQ,您可以使用它也觉得有用。
A few thoughts:
First, you need to put a bit more definite around "best" - there are trade-offs involved in performing such massive processing. Specifically, memory, I/O and CPU utilization are considerations. How much memory each calculation requires. And so on.
Assuming that you are the only process on the machine, you have lots of memory, and you are primarily interested in optimizing throughput, here are some suggestions:
In additional to thread pools, there is also the Task Parallel Library, which offers facilities for simplifying the development of such parallel computations. It is specifically designed to scale to the number of cores and optimize the way threads are used. There's also Parallel LINQ, which you may also find useful.
总的来说,如果您可以等待 .NET 4,PFX(并行扩展)可能是最好的模型。
在此之前,请避免大量进程/线程启动/结束,即。使用线程池(启动进程非常昂贵,启动线程非常昂贵)。
简单的方法:将计算批量分配到应在约 50 毫秒内完成的作业中,然后开始将它们排队。困难的部分是确保一切都完成。一个简单的完成是有一个共享的“已完成”计数器,每个作业都会递增它。主线程在读取计数器时旋转,直到达到预期的最终值。
Overall if you can wait for .NET 4, PFX (parallel extensions) is likely to be the best model.
until then avoid lots of process/thread starts/ends, ie. use the threadpool (starting a process is extremely expensive, starting a thread is very expensive).
Simple approach: batch up the calculations into jobs that should be completed in ~50ms, and then start queuing them up. The hard part is ensuring everything has completed. A simple completion would be to have a shared "completed" counter with each job incrementing it. Main thread spins on reading the counter until it reaches the expected final value.
这在很大程度上取决于“5个计算”的构成。如果执行这 5 项计算需要大量计算,那么多线程将带来巨大的好处。工作量越小,为了获得良好的增益,就需要越仔细地进行划分。
鉴于这是“在每行数据上”运行,处理此问题的最有效方法(如果可能)是直接在数据库中更新此问题。拉取客户端数据、处理和重新填充将比尝试直接在数据库中进行计算慢得多。大多数数据库服务器都会对自己的线程有很好的支持,并且在优化更新方面做得很好,因此如果您可以直接在数据库中处理数据,您将获得最佳性能。
如果这是不可能的,那么我建议考虑使用 任务并行库来处理这个问题。在 .NET 4 上运行将特别有用,因为添加到线程池中的工作窃取将为您提供更好的整体吞吐量。
This depends a lot on what the "5 calculations" constitute. If there is any significant computation required to perform those 5 calculations, then multithreading will be a huge benefit. The smaller the amount of work, the more care will need to go into partitioning in order to get a good gain.
Given that this is running "on each row of data", the most efficient way to handle this (if possible), would be to update this directly within your database. Pulling the data client side, processing, and repopulating will be much slower than trying to do the calculation directly in the DB. Most database servers will have good support for threading on their own, and do a good job of optimizing an update, so if you can work this to process the data directly in the DB, you will get the best performance.
If that's not possible, then I'd recommend looking into using the Task Parallel Library to handle this. Running on .NET 4 will be especially helpful, since the work stealing added to the thread pool will give you a better overall throughput.
线程池是执行此操作的一种安全且简单的方法 - 池中最多可使用 64 个并发线程(这实际上是 WaitHandles 的限制)。使用 Process 对象只会带来调试方面的新问题和复杂性,这些问题和复杂性不值得进行权衡 - 特别是考虑到您获得的任何价值都将来自池将为您提供的并行性。
Threadpools are a safe and easy way to do this - there is a maximum of 64 simultaneous threads available to the pool (this is actually a limit of WaitHandles). Using the Process object just introduces new issues and complexities around debugging that arent worth the perceived trade offs - especially considering that any value you gain would come from the parallelism that the pool will give you.
只有细粒度的评估才能找到优化任务的最佳方法,但使用线程池肯定可以带来改进。
发现最常见的任务并将它们划分到池中。重要的是,衡量绩效的关键方法,
因为只有这样才能知道瓶颈在哪里,以及在哪里需要改进。
Only a granular assessment can bring to light the best way to optimize their task, but certainly the use of a pool of threads may bring improvements.
discover the most common tasks and divide them in the pool. Importantly, key ways to measure performance,
for only then can know where the bottlenecks are and where to strike to improve.
如果可能的话,我建议在数据库过程中执行此操作。否则,无论客户端处理的效率有多高,时间都将主要由通过网络来回编组数据来主导。即使您在同一台计算机上运行该进程,您也可能会因通过(可能是 ODBC)驱动程序序列化所有内容而遭受损失。当然,除非您编写一个可以在数据库服务器的地址空间内运行的本机过程(如果您的服务器支持)。
我想我建议编写一个过程,该过程采用下限和上限来选择记录,然后编写一个客户端程序,分叉几个线程,为每个线程分配一个数据库连接,然后使用适当的方法调用服务器端过程- 大小的边界(假设五个线程,每个线程有 400 万行)。如果您的数据库服务器是多线程的,那么这应该会给您带来不错的性能。
但是,使用任何多线程方法时,请注意,如果您要更新许多行,并且您没有足够频繁地提交事务,则可能会因锁升级而遇到锁定问题。
I'd suggest doing this within a database procedure, if possible. Otherwise, it probably doesn't matter how efficient your client-side processing is, the time will be dominated by marshalling the data back and forth across the network. Even if you run the process on the same machine, you can incur the penalty of serializing everything through your (presumably ODBC) driver. Unless, of course, you write a native procedure that can run within the address space of your database server (if your server supports that).
I guess I'd suggest writing a procedure that takes a lower and upper bound for selecting records, then writing a client-side program that forks off a few threads, allocates a DB connection per thread, then calls the server-side procedure with appropriately-sized bounds (say five threads with four million rows apiece). If your DB server is multithreaded, then this should give you decent performance.
With any multithreaded approach, though, be aware that if you're updating many rows you can wind up with locking problems due to lock escalation if you don't commit your transactions often enough.
如果您使用的是 SQL Server 2005/2008,请考虑将计算作为 CLR 函数添加到 SQL Server:http://msdn.microsoft.com/en-us/library/ms254498%28VS.80%29.aspx。这比在 T-SQL 中进行计算要快得多,并且节省了将数据移入和移出数据库的成本。 SQL Server 将为您管理线程。您还可以尝试打开多个连接,每个连接处理一组不同的行,以衡量对性能、连接时间等的影响。
If you're using SQL Server 2005/2008, consider adding your calculations to SQL Server as CLR functions: http://msdn.microsoft.com/en-us/library/ms254498%28VS.80%29.aspx. This is much faster than doing calculation in T-SQL and saves you the cost of moving data in and out of the database. SQL Server would manage the threads for you. You could also experiment with opening multiple connections, with each one working on a different set of rows to gauge the impact on performance, connection time, etc.