在处理时写入数据块 - 由于硬件限制是否存在收敛值?

发布于 2024-08-17 02:10:04 字数 684 浏览 13 评论 0原文

我正在处理来自硬盘的一个大文件的数据(处理速度很快且开销不大),然后必须将结果写回(数十万个文件)。

我开始立即将结果写入文件,一次一个,这是最慢的选择。我认为,如果我构建一定数量的文件的向量,然后一次将它们全部写入,然后在硬盘被占用以写入我倒入其中的所有内容时返回处理,那么速度会快得多(即至少看起来是这样)。

我的问题是,我能否以某种方式估计应根据硬件限制写入的数据量的收敛值?对我来说,这似乎是一个硬盘缓冲区的事情,我在该硬盘上有 16MB 缓冲区并获取这些值(全部用于约 100000 个文件):

Buffer size      time (minutes)
------------------------------
no Buffer        ~ 8:30
 1 MB            ~ 6:15
10 MB            ~ 5:45
50 MB            ~ 7:00

或者这只是一个巧合?

我还对有关如何优化写入性能的经验/经验法则感兴趣,例如较大的硬盘块是否有用等。

编辑:

硬件是一个非常标准的消费驱动器(我是学生,不是数据中心)WD 3,5 1TB/7200/16MB/USB2,HFS+ 日志式,操作系统是 MacOS 10.5。我很快就会在 Ext3/Linux 和内部磁盘(而不是外部磁盘)上尝试一下。

I'm processing data from a hard disk from one large file (processing is fast and not a lot of overhead) and then have to write the results back (hundreds of thousands of files).

I started writing the results straight away in files, one at a time, which was the slowest option. I figured it gets a lot faster if I build a vector of a certain amount of the files and then write them all at once, then go back to processing while the hard disk is occupied in writing all that stuff that i poured into it (that at least seems to be what happens).

My question is, can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints ? To me it seems to be a hard disk buffer thing, I have 16MB buffer on that hard disk and get these values (all for ~100000 files):

Buffer size      time (minutes)
------------------------------
no Buffer        ~ 8:30
 1 MB            ~ 6:15
10 MB            ~ 5:45
50 MB            ~ 7:00

Or is this just a coincidence ?

I would also be interested in experience / rules of thumb about how writing performance is to be optimized in general, for example are larger hard disk blocks helpful, etc.

Edit:

Hardware is a pretty standard consumer drive (I'm a student, not a data center) WD 3,5 1TB/7200/16MB/USB2, HFS+ journalled, OS is MacOS 10.5. I'll soon give it a try on Ext3/Linux and internal disk rather than external).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

初相遇 2024-08-24 02:10:04

我能否以某种方式估计应根据硬件限制写入的数据量的收敛值?

从长远来看不会。问题是您的写入性能将在很大程度上取决于至少四个因素:

  • 您正在使用哪个文件系统

  • 内核正在使用什么磁盘调度算法< /p>

  • 磁盘的硬件特征

  • 您正在使用的硬件互连

,USB 比 IDE 慢,IDE 又比 SATA 慢。如果 XFS 在写入许多小文件方面比 ext2 快得多,我不会感到惊讶。而且内核一直在变化。因此,这里的因素太多,无法轻松进行简单的预测。

如果我是你,我会采取以下两个步骤:

  • 将我的程序拆分为多个线程(甚至进程),并使用一个线程来传递系统调用openwrite,并尽快关闭操作系统。如果您可以将线程数作为运行时参数,那就加分了。

  • 不要尝试根据硬件特性来估计性能,而是编写一个程序来尝试一系列替代方案,并为当天的特定硬件和软件组合找到最快的方案。将最快的替代方案保存在文件中,甚至将其编译到您的代码中。该策略由 Matteo Frigo 为 FFTW 首创,并且非常有效。

然后,当您更改磁盘、互连、内核或 CPU 时,您只需重新运行配置程序即可!您的代码将被优化以获得最佳性能。

Can I somehow estimate a convergence value for the amount of data that I should write from the hardware constraints?

Not in the long term. The problem is that your write performance is going to depend heavily on at least four things:

  • Which filesystem you're using

  • What disk-scheduling algorithm the kernel is using

  • The hardware characteristics of your disk

  • The hardware interconnect you're using

For example, USB is slower than IDE, which is slower than SATA. It wouldn't surprise me if XFS were much faster than ext2 for writing many small files. And kernels change all the time. So there are just too many factors here to make simple predictions easy.

If I were you I'd take these two steps:

  • Split my program into multiple threads (or even processes) and use one thread to deliver system calls open, write, and close to the OS as quickly as possible. Bonus points if you can make the number of threads a run-time parameter.

  • Instead of trying to estimate performance from hardware characteristics, write a program that tries a bunch of alternatives and finds the fastest one for your particular combination of hardware and software on that day. Save the fastest alternative in a file or even compile it into your code. This strategy was pioneered by Matteo Frigo for FFTW and it is remarkably effective.

Then when you change your disk, your interconnect, your kernel, or your CPU, you can just re-run the configuration program and presto! Your code will be optimized for best performance.

天暗了我发光 2024-08-24 02:10:04

这里重要的是获得尽可能多的未完成写入,以便操作系统可以优化硬盘访问。这意味着使用异步 I/O,或使用任务池将新文件实际写入磁盘。

话虽这么说,您应该考虑优化您的读取访问权限。操作系统(至少是Windows)已经非常擅长通过“幕后”缓冲来帮助写入访问,但是如果您以串行方式读取,它就没有太多帮助。如果使用异步 I/O 或(再次)任务池来一次处理/读取文件的多个部分,您可能会看到性能的提高。

The important thing here is to get as many outstanding writes as possible, so the OS can optimize hard disk access. This means using async I/O, or using a task pool to actually write the new files to disk.

That being said, you should look at optimizing your read access. OS's (at least windows) is already really good at helping write access via buffering "under the hood", but if your reading in serial there isn't too much it can do to help. If use async I/O or (again) a task pool to process/read multiple parts of the file at once, you'll probably see increased perf.

Saygoodbye 2024-08-24 02:10:04

解析 XML 应该可以以实际磁盘读取速度(数十 MB/秒)完成。您的 SAX 实现可能不会这样做。

你可能想使用一些肮脏的伎俩。使用普通 API 写入 100.000 个文件的效率并不高。

首先按顺序写入单个文件(而不是 100.000 个)来测试这一点。比较一下性能。如果差异很有趣,请继续阅读。

如果您真正了解要写入的文件系统,则可以确保您正在写入一个连续的块,稍后将其拆分为目录结构中的多个文件。

在这种情况下,您需要较小的块,而不是较大的块,因为您的文件会很小。块中的所有可用空间都将被清零。

[edit] 您真的对这 100K 文件有外部需求吗?带有索引的单个文件就足够了。

Parsing XML should be doable at practically disk read speed, tens of MB/s. Your SAX implementation might not be doing that.

You might want to use some dirty tricks. 100.000s of files to write is not going to be efficient with the normal API.

Test this by writing sequentially to a single file first, not 100.000. Compare the performance. If the difference is interesting, read on.

If you really understand the file system you're writing to, you can make sure you're writing a contiguous block you just later split into multiple files in the directory structure.

You want smaller blocks in this case, not larger ones, as your files are going to be small. All free space in a block is going to be zeroed.

[edit] Do you really have an external need for those 100K files? A single file with an index could be sufficient.

空城之時有危險 2024-08-24 02:10:04

扩展诺曼的答案:如果您的文件全部进入一个文件系统,则仅使用一个辅助线程。

读取线程和写入助手之间的通信由每个助手两个 std::vector 双缓冲区组成。 (一个缓冲区由写入进程拥有,另一个缓冲区由读取进程拥有。)读取线程填充缓冲区,直到达到指定的限制,然后阻塞。写入线程使用 gettimeofday 或其他方法来计算写入速度,并调整限制。如果写入速度比上次快,请将缓冲区增加 X%。如果速度变慢,请调整 –X%。 X 可以很小。

Expanding on Norman's answer: if your files are all going into one filesystem, use only one helper thread.

Communication between the read thread and write helper(s) consists of a two-std::vector double-buffer per helper. (One buffer owned by the write process and one by the read process.) The read thread fills the buffer until a specified limit then blocks. The write thread times the write speed with gettimeofday or whatever, and adjusts the limit. If writing went faster than last time, increase the buffer by X%. If it went slower, adjust by –X%. X can be small.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文