c++繁重的数据处理和分页

发布于 2024-10-01 12:03:22 字数 488 浏览 4 评论 0原文

我正在编写一个应尽可能实时处理大量数据(1-10 GB 之间)的应用程序。

数据存在于硬盘上的多个二进制数据文件中,每个文件大小在几 kb 到 128 MB 之间。当该过程开始时,首先确定实际需要哪些数据。然后通过用户界面获取一些用户设置,然后逐块处理数据,其中总是将文件加载到内存中,进行处理,然后从内存中清除。这个处理应该很快,因为用户可以更改一些设置,然后重新处理相同的数据,并且这个用户交互应该尽可能流畅。

现在从磁盘加载是相当大的瓶颈,我想在决定使用哪些文件的阶段就预加载数据。但是 - 如果我预加载太多数据,操作系统将使用虚拟内存,并且我将出现大量页面错误,从而使处理速度更慢。

如何确定要预加载多少数据以保持较低的页面错误?我可以以某种方式影响操作系统我想保留在内存中的数据吗?

谢谢!

//编辑:我目前在 Windows 7 64 上运行(但是该应用程序是 32 位),并且该应用程序不需要在任何计算机上运行 - 只需在特定计算机上运行,​​因为这是一个研究项目。

I'm writing an application that should process large ammounts of data (between 1-10 GB) as realtime as possible.

the data is present in multiple binary data files on harddisk, each between few kb and 128MB. when the process starts, first it is decided which data is actually needed. then some user settings are taken through the userinterface and then the data is processed chunk by chunk where always a file is loaded into memory, processed, and then cleared from the memory. this processing should be fast because the user can change some settings and then the same data is reprocessed and this user interaction should be as fluent as possible.

Now the loading from disk is quite some bottleneck and I would like to preload the data already at the stage where it's decided what files will be used. however - if I preload too much data, the os will use virtual memory and i'll have plenty of pagefaults, making the processing even slower.

how can I determine how much data to preload in order to keep pagefaults low? can I influence the os somehow on what data I want to keep in memory?

thanks!

//edit: i'm currently running on Windows 7 64 (the application is 32bit however) and the application does not need to run on any computer - only on a specific one since this is a research project.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

把昨日还给我 2024-10-08 12:03:22

对于一般情况下对大型二进制文件的随机访问,我会考虑使用本机操作系统文件内存映射 API。从性能角度来看,这很可能是最有效的解决方案,大多数操作系统中还有一个系统 API 可以用来锁定内存中的页面,但我不会使用它。当做更具体的事情时,在大多数情况下可以有一个智能索引来准确地知道什么在哪里,并由此解决大多数性能瓶颈。

是的,这并不神奇,如果您需要所有 10G RAM 可用(因为它们的访问频率相同),请在您的机器上选择 16GB RAM。

For a general case random access to large binary files I would consider using native OS file memory mapping API. This will most probably be the most efficient solution from performance perspective, there is also a system API available in most OS-es to lock a page in memory, but I wouldn't use it. When doing something more specific, it is possible in most cases to have a smart indexing to know exactly what is where and solve most performance bottlenecks by that.

And yes, there is no magic, if you need all 10G available in RAM because they are accessed equally often, get 16GB of RAM on your box.

匿名。 2024-10-08 12:03:22

对于 Windows 平台,我建议您查看:

For a Windows platform, I would recommend you look into :

  • MapViewOfFile function : maps a view of a file mapping into the address space of a calling process
  • I/O Completion Ports : an efficient threading model for processing multiple asynchronous I/O requests on a multiprocessor system
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文