按需分页以允许分析大量数据
我正在开发一种分析工具,它可以读取流程的输出并不断将其转换为内部格式。 “记录阶段”完成后,将对数据进行分析。数据全部保存在内存中。
然而,由于所有记录的信息都保存在内存中,因此记录的持续时间存在限制。对于大多数用例来说,这是可以的,但应该可能运行更长时间,即使这会损害性能。
理想情况下,一旦 RAM 使用量达到一定限制,程序应该能够开始使用除 RAM 之外的硬盘空间。
这引出了我的问题: 是否有任何现有的解决方案可以做到这一点?它必须能够在 Unix 和 Windows 上运行。
I am working on an analysis tool that reads output from a process and continuously converts this to an internal format. After the "logging phase" is complete, analysis is done on the data. The data is all held in memory.
However, due to the fact that all logged information is held in memory, there is a limit on the duration of the logging. For most use cases this is ok, but it should be possible to run for longer, even if this will hurt performance.
Ideally, the program should be able to start using hard drive space in addition to RAM once the RAM usage reaches a certain limit.
This leads to my question:
Are there any existing solutions for doing this? It has to work on both Unix and Windows.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了在内存满后使用磁盘,我们使用Cache技术,例如EhCache。它们可以配置要使用的内存量以及溢出到磁盘的内存量。
但它们也有更智能的算法,您可以根据需要进行配置,例如将过去 10 分钟未使用的数据发送到磁盘等......这对您来说可能是一个优势。
To use the disk after memory is full, we use Cache technologies such as EhCache. They can be configured with the amount of memory to use, and to overflow to disk.
But they also have smarter algorithms you can configure as needed, such as sending to disk data not used in the last 10 minutes etc... This could be a plus for you.
如果不了解有关您的应用程序的更多信息,就不可能提供完美的答案。然而,这听起来确实有点像你在重新发明轮子。您是否考虑过使用像 sqlite 这样的进程内数据库?
如果您使用该工具或类似工具,它将负责将数据移入和移出磁盘和内存,同时为您提供强大的 SQL 查询功能。即使您的日志记录数据采用自定义格式,如果每个项目都有某种类型的键或索引,小型轻型数据库可能是一个不错的选择。
Without knowing more about your application it is not possible to provide a perfect answer. However it does sound a bit like you are re-inventing the wheel. Have you considered using an in-process database library like sqlite?
If you used that or similar it will take care of moving the data to and from the disk and memory and give you powerful SQL query capabilities at the same time. Even if your logging data is in a custom format if each item has a key or index of some kind a small light database may be a good fit.
这看起来似乎太明显了,但是内存映射文件呢?这可以满足您的需求,甚至允许 32 位应用程序使用远远超过 4GB 的内存。原理很简单,您分配所需的内存(在磁盘上),然后将其中的一部分映射到系统内存中。例如,您可以映射可用物理内存大小的 75% 之类的内容。然后对其进行处理,当您需要另一部分数据时,只需重新映射即可。这样做的缺点是您必须手动进行映射,但这不一定是坏事。好处是您可以使用比物理内存和每个进程内存限制更多的数据。如果您在任何给定时间实际上仅使用部分数据,那么它的效果非常好。
可能有一些库会自动执行此操作,就像 KLE 建议的那样(尽管我不知道那个)。手动执行意味着您将学到很多关于它的知识并拥有更多的控制权,尽管我更喜欢一个库,如果它在磁盘使用方式和时间方面完全符合您的要求。
这在 Windows 和 Unix 上的工作方式类似。对于 Windows,请参阅 Raymond Chen 撰写的文章这显示了一个简单的例子。
This might seem too obvious, but what about memory mapped files? This does what you want and even allows a 32 bit application to use much more than 4GB of memory. The principle is simple, you allocate the memory you need (on disk) and then map just a portion of that into system memory. You could, for example, map something like 75% of the available physical memory size. Then work on it, and when you need another portion of the data, just re-map. The downside to this is that you have to do the mapping manually, but that's not necessarily bad. The good thing is that you can use more data than what fits into physical memory and into the per-process memory limit. It works really great if you actually use only part of the data at any given time.
There may be libraries that do this automatically, like the one KLE suggested (though I do not know that one). Doing it manually means you'll learn a lot about it and have more control, though I'd prefer a library if it does exactly what you want with regard to how and when the disk is being used.
This works similar on both Windows on Unix. For Windows, here is an article by Raymond Chen that shows a simple example.