CPU 密集型应用程序与 IO 密集型应用程序

发布于 2024-08-09 02:18:52 字数 199 浏览 13 评论 0原文

对于使用大量数据的“数字运算”风格应用程序(读作:“数百 MB,但不是 GB”,即,它将很好地适合操作系统旁边的内存),将所有数据读入内存是否有意义首先在开始处理之前避免在读取大型相关数据集时可能使程序受到 IO 限制,而不是从 RAM 加载它们?

使用不同的数据支持时这个答案会改变吗?即,无论您是否使用 XML 文件、平面文件、完整的 DBMS 等,答案是否相同?

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM?

Does this answer change between using different data backings? ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

看春风乍起 2024-08-16 02:18:52

我猜想选择正确的数据存储方法比一次性从磁盘读取或根据需要读取数据更有效。

大多数数据库表每行中的字段都有规则的偏移量。例如,customer 记录可能有 50 个字节长,并且有一个从第 12 个字节开始的 pants_size 列。选择所有裤子尺码就像获取偏移量 12、62、112、162 处的值一样简单,令人作呕

然而,对于快速数据访问来说,XML 是一种糟糕的格式。您需要费力地浏览一堆可变长度标签和属性才能获取数据,并且您无法立即从一条记录跳转到下一条记录。除非你将文件解析成像上面提到的那样的数据结构。在这种情况下,您将拥有非常类似于 RDMS 的东西,所以您就可以了。

I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed.

Most database tables have regular offsets for fields in each row. For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum.

XML, however, is a lousy format for fast data access. You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. Unless you parse the file into a data structure like the one mentioned above. In which case you'd have something very much like an RDMS, so there you go.

红玫瑰 2024-08-16 02:18:52

你的程序的速度与它的瓶颈一样快。如果可以提高整体性能,那么将数据存储在内存中之类的事情是有意义的。然而,没有硬性规则表明它会提高性能。当你解决一个瓶颈时,新的东西就会成为瓶颈。因此,解决一个问题可能会使性能提高 1% 或 1000%,具体取决于下一个瓶颈是什么。你正在改进的事情可能仍然是瓶颈。

我认为这些事情通常适合三个级别之一:

  1. 渴望。当您需要来自磁盘或网络的某些东西或计算结果时,您就会去获取或执行它。这是最简单的编程,最容易测试和调试,但性能最差。只要这方面不是瓶颈就可以了;
  2. 懒惰。一旦完成特定的读取或计算,就不要在一段时间内再次执行该操作,这可能是从几毫秒到永远的任何时间。这可能会增加程序的复杂性,但如果读取或计算成本很高,则可以获得巨大的好处;和
  3. 过度渴望。这很像前两者的组合。结果被缓存,但不是进行读取、计算或请求,而是进行一定量的抢占活动来预测您可能想要的结果。就像如果您从文件中读取 10K 数据一样,您稍后可能需要下一个 10K 数据块的可能性相当高。您不会延迟执行,而是在有请求时才获取它。

从中吸取的教训是唐纳德·高德纳 (Donald Knuth) 的一句话(有点过度使用且经常被错误引用):“过早的优化是万恶之源。”急切和过度急切的解决方案会增加巨大的复杂性,因此没有必要为了不会产生有用利益的事情而这样做。

程序员经常犯这样的错误:在确定是否需要以及是否有用之前,就创建了某种高度(据称)优化的版本。

我对此的看法是:在遇到问题之前不要解决问题。

Your program is as fast as whatever its bottleneck is. It makes sense to do things like storing your data in memory if that improves the overall performance. There is no hard and fast rule that says it will improve performance however. When you fix one bottleneck, something new becomes the bottleneck. So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. The thing you're improving may still be the bottleneck.

I think about these things as generally fitting into one of three levels:

  1. Eager. When you need something from disk or from a network or the result of a calculation you go and get or do it. This is the simplest to program, the easiest to test and debug but the worst for performance. This is fine so long as this aspect isn't the bottleneck;
  2. Lazy. Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; and
  3. Over-eager. This is much like a combination of the previous two. Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. Rather than delay execution you get it just in case it's requested.

The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit.

Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful.

My own take on this is: don't solve a problem until you have a problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文