C# 流设计问题
我现在有一个应用程序是管道设计。在第一阶段,它将一些数据和文件读入流中。有一些中间阶段会对数据流进行处理。然后是最后一个阶段,将流写入某个地方。这一切都是连续发生的,一个阶段完成,然后移交给下一阶段。
这一切都运行得很好,但现在数据量开始变得相当大(可能达到数百 GB)。所以我想我需要做一些事情来缓解这种情况。我最初的想法是我正在寻找一些反馈(作为一名独立开发人员,我只是没有任何地方可以反馈这个想法)。
我正在考虑创建一个并行管道。启动管道的对象将创建所有阶段并在其自己的线程中启动每个阶段。当第一阶段使流达到一定大小时,它将将该流传递到下一个阶段进行处理,并启动自己的新流以继续填充。这里的想法是,最后阶段将关闭流,因为第一阶段正在构建新流,因此我的内存使用量将保持较低。
所以问题: 1)对此设计的方向有什么高层次的想法吗? 2)是否有一种更简单的方法可以应用在这里? 3)是否有任何现有的东西可以做类似的事情,我可以重复使用(不是我必须购买的产品)?
谢谢,
迈克D
I have an appliction right now that is a pipeline design. In one the first stage it reads some data and files into a Stream. There are some intermediate stages that do stuff to the stream of data. And then there is a final stage that writes the stream out to somewhere. This all happens serially, one stage completes and then hands off to the next stage.
This all has been working just great, but now the amount of data is starting to get quite a bit larger (hundreds of GB potentially). So I'm thinking that I will need to do something to alleviate this. My initial thought is what I'm looking for some feedback on (being an independent developer I just don't have anywhere to bounce the idea off of).
I'm thinking of creating a Parallel pipeline. The Object that starts off the pipeline would create all of the stages and kick each one off in it's own thread. When the first stage gets the stream to some certain size then it will pass that stream off to the next stage for processing and start up a new stream of its own to continue to fill up. The idea here being that the final stage will be closing out streams as the first stage is building a new ones so my memory usage would be kept lower.
So questions:
1) Any high level thoughts on directions for this design?
2) Is there a simpler approach that you can think of that might apply here?
3) Is there anything existing out there that does something like this that I could reuse (not a product I have to buy)?
Thanks,
MikeD
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
对于您建议的设计,您需要仔细阅读 生产者/消费者问题(如果您还没有的话)。您需要充分了解如何在这种情况下使用信号量。
您可以尝试的另一种方法是创建多个相同的管道,每个管道都在一个单独的线程中。这可能更容易编码,因为它的线程间通信要少得多。但是,根据您的数据,您可能无法通过这种方式将其拆分为块。
For the design you've suggested, you'd want to have a good read up on producer/consumer problems if you haven't already. You'll need a good understanding of how to use semaphores in that situation.
Another approach you could try is to create multiple identical pipelines, each in a separate thread. This would probably be easier to code because it has a lot less inter-thread communication. However, depending on your data you may not be able to split it into chunks this way.
在每个阶段中,您是否读取整个数据块,进行操作,然后将整个数据块发送到下一个阶段?
如果是这种情况,您正在使用“推送”技术,将整个数据块推送到下一阶段。您是否能够使用“拉动”技术以更像庄园的方式处理事情?每个阶段都是一个流,当您从该流读取数据时,它会通过调用 read 从前一个流中提取数据。当读取每个流时,它会以小位读取前一个流,对其进行处理并返回处理后的数据。目标流决定从前一个流中读取多少字节,并且您不必消耗大量内存。这就是 BizTalk 等应用程序的工作原理。有一些关于 BizTalk Pipeline 流如何工作的博客,我认为这可能正是您想要的。
这是您可能会感兴趣的多部分博客条目:
第 1 部分
第 2 部分
第三部分
第 4 部分
第 5 部分
In each stage, do you read the entire chunk of data, do the manipulation, then send the entire chuck to the next stage?
If that is the case, you are using a "push" technique where you push the entire chunk of data to the next stage. Are you able to handle things in a more stream like manor using a "pull" technique? Each stage is a stream, and as you read data from that stream, it pulls data from the previous stream by calling read on it. As each stream is being read, it reads from the previous stream in small bits, processes it and returns the processed data. The destination stream determines how many bytes to read from the previous stream, and you don't ever have to consume large amounts of memory. This is how applications like BizTalk work. There are some blogs about how BizTalk Pipeline streams work, and I think it might be exactly what you want.
Here's a multi-part blog entry that you might find interesting:
Part 1
Part 2
Part 3
Part 4
Part 5
生产者/消费者模型是一个很好的方法。 Microsoft 还推出了新的并行扩展< /a> 应该为您提供大部分基础工作。查看任务对象。 .NET 3.5 / VS2008 有一个预览版本。
您的第一个任务应该从流中读取数据块,然后将它们传递给其他任务。然后,在中间安排尽可能多的符合逻辑的任务。较小的任务(通常)更好。您唯一需要注意的是确保最后一个任务按照读取的顺序保存数据(因为中间的所有任务可能以与它们开始的顺序不同的顺序完成)。
The producer/consumer model is a good way to proceed. And Microsoft has their new Parallel Extensions which should provide most of the ground work for you. Look into the Task object. There's a preview release available for .NET 3.5 / VS2008.
Your first task should read blocks of data from your stream and then pass them onto other tasks. Then, have as many tasks in the middle as logically fit. Smaller tasks are (generally) better. The only thing you need to watch out for is to make sure the last task saves the data in the order it was read (because all the tasks in the middle may finish in a different order to what they started).