C# 多线程文件IO(读)

发布于 2024-08-29 08:24:52 字数 372 浏览 2 评论 0原文

我们的情况是,我们的应用程序需要处理一系列文件,而不是同步执行此功能,我们希望采用多线程将工作负载分配给不同的线程。

每项工作是:
1. 以只读方式打开文件
2.处理文件中的数据
3. 将处理后的数据写入字典

我们想在新线程上执行每个文件的工作吗? 这是否可能,我们应该更好地使用 ThreadPool 或生成新线程,记住每项“工作”只需要 30 毫秒,但可能需要处理数百个文件。

任何可以提高效率的想法都值得赞赏。

编辑:目前我们正在利用线程池来处理这个问题。如果我们有 500 个文件要处理,我们会循环处理这些文件,并使用 QueueUserWorkItem 将每个“处理工作单元”分配给线程池。

为此是否适合使用线程池?

We have a situation where our application needs to process a series of files and rather than perform this function synchronously, we would like to employ multi-threading to have the workload split amongst different threads.

Each item of work is:
1. Open a file for read only
2. Process the data in the file
3. Write the processed data to a Dictionary

We would like to perform each file's work on a new thread?
Is this possible and should be we better to use the ThreadPool or spawn new threads keeping in mind that each item of "work" only takes 30ms however its possible that hundreds of files will need to be processed.

Any ideas to make this more efficient is appreciated.

EDIT: At the moment we are making use of the ThreadPool to handle this. If we have 500 files to process we cycle through the files and allocate each "unit of processing work" to the threadpool using QueueUserWorkItem.

Is it suitable to make use of the threadpool for this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

我很坚强 2024-09-05 08:24:52

我建议您使用 ThreadPool.QueueUserWorkItem(...),在这种情况下,线程由系统和.net 框架管理。您与自己的线程池相结合的机会要高得多。所以我建议你使用.net提供的Threadpool。
使用起来非常简单,

ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod); 

YourMethod(object o){
您的代码在这里...
}

如需更多阅读,请点击链接 http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx

希望,这有帮助

I would suggest you to use ThreadPool.QueueUserWorkItem(...), in this, threads are managed by the system and the .net framework. The chances of you meshing up with your own threadpool is much higher. So I would recommend you to use Threadpool provided by .net .
It's very easy to use,

ThreadPool.QueueUserWorkItem(new WaitCallback(YourMethod), ParameterToBeUsedByMethod); 

YourMethod(object o){
Your Code here...
}

For more reading please follow the link http://msdn.microsoft.com/en-us/library/3dasc8as%28VS.80%29.aspx

Hope, this helps

伪装你 2024-09-05 08:24:52

我建议你有有限数量的线程(比如 4 个),然后有 4 个工作池。即,如果您有 400 个文件要处理,则每个线程平均分配 100 个文件。然后,您生成线程,并将其工作传递给每个线程并让它们运行,直到完成特定的工作。

您只有一定量的 I/O 带宽,因此拥有太多线程不会带来任何好处,还要记住创建线程也需要少量时间。

I suggest you have a finite number of threads (say 4) and then have 4 pools of work. I.e. If you have 400 files to process have 100 files per thread split evenly. You then spawn the threads, and pass to each their work and let them run until they have finished their specific work.

You only have a certain amount of I/O bandwidth so having too many threads will not provide any benefits, also remember that creating a thread also takes a small amount of time.

剧终人散尽 2024-09-05 08:24:52

我建议使用更高级别的库,例如 并行扩展 (PEX):

var filesContent = from file in enumerableOfFilesToProcess
                   select new 
                   {
                       File=file, 
                       Content=File.ReadAllText(file)
                   };

var processedContent = from content in filesContent
                       select new 
                       {
                           content.File, 
                           ProcessedContent = ProcessContent(content.Content)
                       };

var dictionary = processedContent
           .AsParallel()
           .ToDictionary(c => c.File);

PEX 将根据可用内核和负载处理线程管理,同时您可以专注于手头的业务逻辑(哇,这听起来像商业广告!)

PEX 是.Net Framework 4.0,但也可以作为 的一部分向后移植到 3.5反应式框架

Instead of having to deal with threads or manage thread pools directly I would suggest using a higher-level library like Parallel Extensions (PEX):

var filesContent = from file in enumerableOfFilesToProcess
                   select new 
                   {
                       File=file, 
                       Content=File.ReadAllText(file)
                   };

var processedContent = from content in filesContent
                       select new 
                       {
                           content.File, 
                           ProcessedContent = ProcessContent(content.Content)
                       };

var dictionary = processedContent
           .AsParallel()
           .ToDictionary(c => c.File);

PEX will handle thread management according to available cores and load while you get to concentrate about the business logic at hand (wow, that sounded like a commercial!)

PEX is part of the .Net Framework 4.0 but a back-port to 3.5 is also available as part of the Reactive Framework.

魔法唧唧 2024-09-05 08:24:52

我建议使用 CCR(并发和协调运行时) 它将处理为您提供的低级线程详细信息。至于您的策略,每个工作项一个线程可能不是最好的方法,具体取决于您尝试写入字典的方式,因为字典不是线程安全的,因此可能会产生严重的争用。

下面是一些使用 CCR 的示例代码,Interleave 在这里可以很好地工作:

Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
    new TeardownReceiverGroup(Arbiter.Receive<bool>(
        false, mainPort, new Handler<bool>(Teardown))),
    new ExclusiveReceiverGroup(Arbiter.Receive<object>(
        true, mainPort, new Handler<object>(WriteData))),
    new ConcurrentReceiverGroup(Arbiter.Receive<string>(
        true, mainPort, new Handler<string>(ReadAndProcessData)))));

public void WriteData(object data)
{
    // write data to the dictionary
    // this code is never executed in parallel so no synchronization code needed
}

public void ReadAndProcessData(string s)
{
    // this code gets scheduled to be executed in parallel
    // CCR take care of the task scheduling for you
}

public void Teardown(bool b)
{
    // clean up when all tasks are done
}

I suggest using the CCR (Concurrency and Coordination Runtime) it will handle the low-level threading details for you. As for your strategy, one thread per work item may not be the best approach depending on how you attempt to write to the dictionary, because you may create heavy contention since dictionaries aren't thread safe.

Here's some sample code using the CCR, an Interleave would work nicely here:

Arbiter.Activate(dispatcherQueue, Arbiter.Interleave(
    new TeardownReceiverGroup(Arbiter.Receive<bool>(
        false, mainPort, new Handler<bool>(Teardown))),
    new ExclusiveReceiverGroup(Arbiter.Receive<object>(
        true, mainPort, new Handler<object>(WriteData))),
    new ConcurrentReceiverGroup(Arbiter.Receive<string>(
        true, mainPort, new Handler<string>(ReadAndProcessData)))));

public void WriteData(object data)
{
    // write data to the dictionary
    // this code is never executed in parallel so no synchronization code needed
}

public void ReadAndProcessData(string s)
{
    // this code gets scheduled to be executed in parallel
    // CCR take care of the task scheduling for you
}

public void Teardown(bool b)
{
    // clean up when all tasks are done
}
清欢 2024-09-05 08:24:52

从长远来看,我认为如果您管理自己的线程,您会更高兴。这将使您可以控制正在运行的数量并轻松报告状态。

  1. 构建一个执行处理的工作器类,并为其提供一个回调例程以返回结果和状态。
  2. 对于每个文件,创建一个工作实例和一个运行它的线程。将线程放入队列中。
  3. 将线程从队列中剥离到您想要同时运行的最大数量。当每个线程完成后,再获取另一个线程。调整最大值并测量吞吐量。我更喜欢使用 Dictionary 来保存正在运行的线程,由它们的 ManagedThreadId 键入。
  4. 想要提早停车,只需清理队列即可。
  5. 在你的线程集合周围使用锁定来保持你的理智。

In the long run, I think you'll be happier if you manage your own threads. This will let you control how many are running and make it easy to report status.

  1. Build a worker class that does the processing and give it a callback routine to return results and status.
  2. For each file, create a worker instance and a thread to run it. Put the thread in a Queue.
  3. Peel threads off of the queue up to the maximum you want to run simultaneously. As each thread completes go get another one. Adjust the maximum and measure throughput. I prefer to use a Dictionary to hold running threads, keyed by their ManagedThreadId.
  4. To stop early, just clear the queue.
  5. Use locking around your thread collections to preserve your sanity.
始于初秋 2024-09-05 08:24:52

使用 ThreadPool.QueueUserWorkItem 执行每个独立任务。绝对不要创建数百个线程。这可能会引起严重的头痛。

Use ThreadPool.QueueUserWorkItem to execute each independent task. Definitely don't create hundreds of threads. That is likely to cause major headaches.

很酷不放纵 2024-09-05 08:24:52

使用 ThreadPool 的一般规则是,如果您不想担心线程何时完成(或使用互斥体来跟踪它们),或者担心停止线程。

那么您需要担心工作何时完成吗?如果没有,ThreadPool 是最好的选择。如果您想跟踪总体进度,请停止线程,那么您自己的线程集合是最好的。

如果您重复使用线程,ThreadPool 通常会更高效。 这个问题将为您提供更详细的讨论。

热值

The general rule for using the ThreadPool is if you don't want to worry about when the threads finish (or use Mutexes to track them), or worry about stopping the threads.

So do you need to worry about when the work is done? If not, the ThreadPool is the best option. If you want to track the overall progress, stop threads then your own collection of threads is best.

ThreadPool is generally more efficient if you are re-using threads. This question will give you a more detailed discussion.

Hth

一生独一 2024-09-05 08:24:52

对每个单独的任务使用线程池绝对是一个坏主意。根据我的经验,这对性能的损害大于帮助。第一个原因是,仅仅为线程池分配一个任务来执行就需要大量的开销。默认情况下,每个应用程序都分配有自己的线程池,该线程池初始化为约 100 个线程容量。当您并行执行 400 个操作时,队列很快就会被请求填满,现在大约有 100 个线程都在竞争 CPU 周期。是的,.NET 框架在限制队列和确定队列优先级方面做得很好,但是,我发现 ThreadPool 最好留给可能不会经常发生的长时间运行的操作(加载配置文件或随机 Web 请求) )。使用 ThreadPool 随机触发几个操作比使用它一次执行数百个请求要高效得多。根据当前信息,最好的操作过程与此类似:

  1. 请求的队列

  2. 使用 FileStream 的 BeginRead 和 BeginWrite 方法来执行 IO 操作。这将导致 .NET 框架使用本机 API 来线程化并执行 IO (IOCP)。

这将为您提供两个优势,一是您的请求仍将得到并行处理,同时允许操作系统管理文件系统访问和线程。第二个是,由于绝大多数系统的瓶颈将是 HDD,因此您可以对请求线程实施自定义优先级排序和限制,以更好地控制资源使用。

目前我一直在编写一个类似的应用程序,使用这种方法既高效又快速...在没有任何线程或节流的情况下,我的应用程序仅使用 10-15% CPU,这对于某些操作来说是可以接受的,具体取决于所涉及的处理,它使我的 PC 速度慢得就像某个应用程序使用了 80% 以上的 CPU 一样。这是文件系统访问。 ThreadPool 和 IOCP 函数并不关心它们是否会使 PC 陷入困境,所以不要混淆,它们针对性能进行了优化,即使该性能意味着您的 HDD 像猪一样尖叫。

我遇到的唯一问题是在测试阶段内存使用量有点高(50+ mb),同时打开大约 35 个流。我目前正在研究类似于 MSDN 建议的解决方案 SocketAsyncEventArgs,使用池允许 x 个请求同时操作,这最终让我看到了这篇论坛帖子。

希望这对人们将来的决策有所帮助:)

Using the ThreadPool for each individual task is definitely a bad idea. From my experience this tends to hurt performance more than helping it. The first reason is that a considerable amount of overhead is required just to allocate a task for the ThreadPool to execute. By default, each application is assigned it's own ThreadPool that is initialized with ~100 thread capacity. When you are executing 400 operations in a parallel, it does not take long to fill the queue with requests and now you have ~100 threads all competing for CPU cycles. Yes the .NET framework does a great job with throttling and prioritizing the queue, however, I have found that the ThreadPool is best left for long-running operations that probably won't occur very often (loading a configuration file, or random web requests). Using the ThreadPool to fire off a few operations at random is much more efficient than using it to execute hundreds of requests at once. Given the current information, the best course of action would be something similar to this:

  1. Create a System.Threading.Thread (or use a SINGLE ThreadPool thread) with a queue that the application can post requests to

  2. Use the FileStream's BeginRead and BeginWrite methods to perform the IO operations. This will cause the .NET framework to use native API's to thread and execute the IO (IOCP).

This will give you 2 leverages, one is that your requests will still get processed in parallel while allowing the operating system to manage file system access and threading. The second is that because the bottleneck of the vast majority of systems will be the HDD, you can implement a custom priority sort and throttling to your request thread to give greater control over resource usage.

Currently I have been writing a similar application and using this method is both efficient and fast... Without any threading or throttling my application was only using 10-15% CPU, which can be acceptable for some operations depending on the processing involved, however, it made my PC as slow as if an application was using 80%+ of the CPU. This was the file system access. The ThreadPool and IOCP functions do not care if they are bogging the PC down, so don't get confused, they are optimized for performance, even if that performance means your HDD is squeeling like a pig.

The only problem I have had is memory usage ran a little high (50+ mb) during the testing phaze with approximately 35 streams open at once. I am currently working on a solution similar to the MSDN recommendation for SocketAsyncEventArgs, using a pool to allow x number of requests to be operating simultaneously, which ultimately led me to this forum post.

Hope this helps somebody with their decision making in the future :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文