C# 中的密集文件 I/O 和数据处理
我正在编写一个需要处理大型文本文件的应用程序(用逗号分隔几种不同类型的记录 - 我没有能力或倾向更改数据存储格式)。它读入记录(通常是按顺序读取文件中的所有记录,但并非总是如此),然后将每个记录的数据传递出去以进行某些处理。
现在应用程序的这一部分是单线程的(读取一条记录,处理它,读取下一条记录等)我认为在一个线程中读取队列中的记录并在另一个线程中处理它们可能会更有效小块中的线程或当它们可用时。
我不知道如何开始编程类似的东西,包括必要的数据结构或如何正确实现多线程。任何人都可以给出任何指示,或者提供有关我如何提高性能的其他建议吗?
I'm writing an app which needs to process a large text file (comma-separated with several different types of records - I do not have the power or inclination to change the data storage format). It reads in records (often all the records in the file sequentially, but not always), then the data for each record is passed off for some processing.
Right now this part of the application is single threaded (read a record, process it, read the next record, etc.) I'm thinking it might be more efficient to read records in a queue in one thread, and process them in another thread in small blocks or as they become available.
I have no idea how to start programming something like that, including the data structure that would be necessary or how to implement the multithreading properly. Can anyone give any pointers, or offer other suggestions about how I might improve performance here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您能够平衡处理记录的时间和读取记录的时间,您可能会受益;在这种情况下,您可以使用生产者/消费者设置,例如同步队列以及一个(或几个)工人出队和处理。我可能也想研究并行扩展;编写阅读代码的
IEnumerable
版本非常容易,然后编写Parallel.ForEach
(或其他Parallel
之一)方法)实际上应该做你想做的一切;例如:You might get a benefit if you can balance the time processing records against the time reading records; in which case you could use a producer/consumer setup, for example synchronized queue and a worker (or a few) dequeueing and processing. I might also be tempted to investigate parallel extensions; it is pertty easy to write an
IEnumerable<T>
version of your reading code, after whichParallel.ForEach
(or one of the otherParallel
methods) should actually do everything you want; for example:看看本教程,它包含您需要的所有内容...这些是微软教程,包括您所描述的类似案例的代码示例。您的生产者填充队列,而消费者弹出记录。
创建、启动线程并在线程之间交互
同步两个线程:生产者和消费者< /a>
Take a look at this tutorial, it contains all you need... These are the microsoft tutorials including code samples for a similiar case as you describe. Your producer fills the queue, while the consumer pops records off.
Creating, starting, and interacting between threads
Synchronizing two threads: a producer and a consumer
您还可以查看异步 I/O。在这种风格中,您将从主线程启动文件操作,然后它将继续在后台运行,并在完成时调用您指定的回调。与此同时,您可以继续做其他事情(例如处理数据)。例如,您可以启动异步操作来读取接下来的 1000 个字节,然后处理已有的 1000 个字节,然后等待下一个千字节。
不幸的是,用 C# 编写异步操作有点痛苦。有一个 MSDN 示例,但它一点也不好。在 F# 中使用异步工作流程可以很好地解决这个问题。我写了一篇文章解释了这个问题,并展示了如何使用 C# 迭代器 做类似的事情。
对于 C# 来说,一个更有前途的解决方案是 Wintellect PowerThreading 库,它支持使用 C# 迭代器的类似技巧。 Jeffrey Richter 的 MSDN 并发事务中有一篇很好的介绍性文章。
You may also look at asynchronous I/O. In this style, you'll start a file operation from the main thread, it will then continue running in background and when it completes, it invokes a callback that you specified. In the meantime, you can continue doing other things (such as processing the data). For example, you could start an asynchronous operation to read the next 1000 bytes, then process the 1000 bytes you already have and then wait for the next kilobyte.
Unfortunately, programming asynchronous operations in C# is a bit painful. There is a MSDN sample, but it's not nice at all. This can be nicely solved in F# using asynchronous workflows. I wrote an article that explains the problem and shows how to do similar thing using C# iterators.
A more promissing solution for C# is Wintellect PowerThreading library which supports similar trick using C# iterators. There is a good introductory article in MSDN Concurrency Affairs by Jeffrey Richter.