设计模式是否可以处理超出计算机内存容量的数据?
我想编写一个可以处理大量数据的应用程序(比如说,多年的价格数据)。数据可以来自文件服务器、网络等,但其想法是计算机内存中一次无法容纳太多数据。当我处理数据时,我会将结果写出(比如说,写入磁盘),然后我可以丢弃数据。
我正在使用 F#,因此与 .NET 相关的反馈非常有帮助。我不必有具体的答案,只需指出在这个问题领域进行良好的阅读就将非常感激。
有相关的设计模式或工具包吗?它看起来与数据流编程类似,因为我只想一次处理部分可用数据,但与数据流编程不同的是,我想要提取数据而不是等待数据到达然后做出反应。
我也想对这些数据进行并行处理。我目前正在考虑构建它的方式是: 一个。每个线程都请求一些数据来处理。 b.数据读取器提取计算机内存中可以缓存的尽可能大的请求数据块。当线程完成此块时,可以拉入并缓存另一个块。 c.数据读取器还知道当前缓存了哪些块,因此,如果多个线程请求相同的块,它们都可以从同一个缓存中读取(它们不必写入)。 同样,是否有 .NET 数据结构或设计模式?
最后,所有这些工作是否只是过度设计轮子?即,例如,尝试将整个数据流吸入数组或散列并让操作系统分页担心我上面描述的问题是否更好?
我想 SQL Server 可以处理这样的问题,但我想要读取的数据可能不在数据库中,并且我不想引入对 SQL Server 的依赖。我还知道 F# 有用于延迟计算数据的序列,但我不确定这是否适用于数据的随机访问 - 即我可能想从整个流中的任何位置获取数据,并且只有从那时起我才会按顺序访问它。
I want to write an app that can process large amounts of data (let's say, years of tick price data). The data can come from a file server, the Web, etc, but the idea is there's too much of it to hold in the computer's memory at one time. As I process the data, I'll write the results out (let's say, to disk), then I can discard the data.
I'm working in F# so feedback relating to .NET is most helpful. I don't have to have concrete answers, just pointers to good reading in this problem area would be very much appreciated.
Is there a design pattern or toolkit for this? It seems similar to dataflow programming, in that I only want to work on part of the available data at one time, except that unlike dataflow programming I want to pull the data in rather than wait for it to arrive and then react.
I also want to do parallel processing of this data. The way I'm currently thinking of structuring this is:
a. Each thread requests some data to work with.
b. A data reader pulls in as large a chunk of the requested data as can be cached in the computer's memory. When the thread finishes with this chunk, another chunk can be pulled in and cached.
c. The data reader also knows what chunks are currently cached, so that if multiple threads request the same chunk, they can all read from the same cache (they won't have to write to it).
Again, is there a .NET data structure or design pattern for this?Finally, is all this work just overengineering the wheel? I.e., for instance, is it better to just try to suck the entire data stream into an array or hash and let OS paging worry about the issues I describe above?
I imagine SQL Server deals with issues like this, but the data I want to read might not be in a database and I'd prefer not to introduce a dependency on SQL Server. I also know that F# has sequences for lazy evaluation of data, but I'm not sure that applies to random access of data - i.e. I might want to get data from any place in the entire stream, and only from that point will I be accessing it sequentially.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
通过使用 .NET 中的 Stream 类似乎可以很好地回答主要问题。流可以在任何东西(内存、文件、网络等)上实现。因此,如果您编写代码从流中读入并将其写出到不同的流,则可以更改读取或写入实现,而无需更改其余代码。
就并行处理而言,我假设大文件中存在“记录”概念。如果是这种情况,并且由于您使用的是 F#,您应该能够在流上创建迭代器,然后使用 F# 的并行功能来处理每个记录。
The main question seems to be answered quite nicely by using the Stream classes in .NET. Streams can be implemented over just about anything (memory, file, network, etc.) So, if you write your code to read in from a stream and write out to a different stream, you can change the read or write implementation without changing the rest of the code.
As far as parallel processing is concerned, I assume there is a "record" concept in the large files. If that is the case and since you're using F#, you should just be able to create an iterator over the stream, then use F#'s parallelism features to process each record.
我会使用主/从设计模式,这就是我认为您要使用的地方 2. 不要让操作系统分页数据,您的速度将会严重减慢,并且您的应用程序将永远无法完成。
I would use a master/slave design pattern, which is kind of where I think you were going with 2. Do not let the OS page the data, you will have horrible slowdown and your application will never finish.