设计模式是否可以处理超出计算机内存容量的数据？

发布于 2024-12-09 06:48:52 字数 733 浏览 0 评论 0原文

我想编写一个可以处理大量数据的应用程序（比如说，多年的价格数据）。数据可以来自文件服务器、网络等，但其想法是计算机内存中一次无法容纳太多数据。当我处理数据时，我会将结果写出（比如说，写入磁盘），然后我可以丢弃数据。

我正在使用 F#，因此与 .NET 相关的反馈非常有帮助。我不必有具体的答案，只需指出在这个问题领域进行良好的阅读就将非常感激。

有相关的设计模式或工具包吗？它看起来与数据流编程类似，因为我只想一次处理部分可用数据，但与数据流编程不同的是，我想要提取数据而不是等待数据到达然后做出反应。
我也想对这些数据进行并行处理。我目前正在考虑构建它的方式是：一个。每个线程都请求一些数据来处理。 b.数据读取器提取计算机内存中可以缓存的尽可能大的请求数据块。当线程完成此块时，可以拉入并缓存另一个块。 c.数据读取器还知道当前缓存了哪些块，因此，如果多个线程请求相同的块，它们都可以从同一个缓存中读取（它们不必写入）。同样，是否有 .NET 数据结构或设计模式？
最后，所有这些工作是否只是过度设计轮子？即，例如，尝试将整个数据流吸入数组或散列并让操作系统分页担心我上面描述的问题是否更好？

我想 SQL Server 可以处理这样的问题，但我想要读取的数据可能不在数据库中，并且我不想引入对 SQL Server 的依赖。我还知道 F# 有用于延迟计算数据的序列，但我不确定这是否适用于数据的随机访问 - 即我可能想从整个流中的任何位置获取数据，并且只有从那时起我才会按顺序访问它。

原文

I want to write an app that can process large amounts of data (let's say, years of tick price data). The data can come from a file server, the Web, etc, but the idea is there's too much of it to hold in the computer's memory at one time. As I process the data, I'll write the results out (let's say, to disk), then I can discard the data.

I'm working in F# so feedback relating to .NET is most helpful. I don't have to have concrete answers, just pointers to good reading in this problem area would be very much appreciated.

Is there a design pattern or toolkit for this? It seems similar to dataflow programming, in that I only want to work on part of the available data at one time, except that unlike dataflow programming I want to pull the data in rather than wait for it to arrive and then react.
I also want to do parallel processing of this data. The way I'm currently thinking of structuring this is:
a. Each thread requests some data to work with.
b. A data reader pulls in as large a chunk of the requested data as can be cached in the computer's memory. When the thread finishes with this chunk, another chunk can be pulled in and cached.
c. The data reader also knows what chunks are currently cached, so that if multiple threads request the same chunk, they can all read from the same cache (they won't have to write to it).
Again, is there a .NET data structure or design pattern for this?
Finally, is all this work just overengineering the wheel? I.e., for instance, is it better to just try to suck the entire data stream into an array or hash and let OS paging worry about the issues I describe above?

I imagine SQL Server deals with issues like this, but the data I want to read might not be in a database and I'd prefer not to introduce a dependency on SQL Server. I also know that F# has sequences for lazy evaluation of data, but I'm not sure that applies to random access of data - i.e. I might want to get data from any place in the entire stream, and only from that point will I be accessing it sequentially.

分享到QQ

分享到微博