从磁盘读取 10 GB 文件的最快方法是什么?

发布于 2024-08-03 20:19:30 字数 936 浏览 7 评论 0原文

我们需要读取并统计不同类型的消息/运行 10 GB 文本文件的一些统计信息,例如 FIX 引擎 日志。我们使用 Linux、32 位、4 个 CPU、Intel、使用 Perl 进行编码,但是 语言并不重要。

我在蒂姆·布雷 (Tim Bray) 的书中发现了一些有趣的提示 WideFinder 项目。然而,我们发现使用内存映射 本质上受到 32 位架构的限制。

我们尝试使用多个进程,这似乎有效 如果我们使用 4 个进程并行处理文件,速度会更快 在 4 个 CPU 上。添加多线程可能会减慢速度 因为上下文切换的成本。我们尝试改变 线程池的大小,但仍然慢于 简单的多进程版本。

内存映射部分不是很稳定,有时会 处理 2 GB 文件需要 80 秒,有时需要 7 秒,可能来自 页面错误或与虚拟内存使用相关的问题。 无论如何,Mmap 在 32 位上无法扩展到超过 4 GB 建筑学。

我们尝试了 Perl 的 IPC::MmapSys::Mmap。看起来 也进入 Map-Reduce,但问题实际上是 I/O 绑定,处理本身足够快。

所以我们决定尝试通过调优来优化基本I/O 缓冲大小、类型等。

任何人都知道现有项目中存在此问题吗? 问题在任何语言/平台上都得到了有效解决 指向一个有用的链接或建议一个方向?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.

I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.

We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.

The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.

We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.

So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.

Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

や莫失莫忘 2024-08-10 20:19:30

大多数时候,您将受到 I/O 限制而不是 CPU 限制,因此只需通过普通 Perl I/O 读取该文件并在单线程中处理它。除非您证明您可以完成比单个 CPU 工作更多的 I/O,否则不要浪费时间做更多事情。不管怎样,你应该问:为什么这是一个巨大的文件?他们到底为什么在生成的时候不以合理的方式来分割它呢?这将是更有价值的工作。然后你可以将它放在单独的 I/O 通道中并使用更多的 CPU(如果你不使用某种 RAID 0 或 NAS 或 ...)。

衡量,不要假设。不要忘记在每次测试之前刷新缓存。请记住,串行 I/O 比随机 I/O 快一个数量级。

Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).

Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.

小清晰的声音 2024-08-10 20:19:30

这一切都取决于您可以进行何种类型的预处理以及何时进行。
在我们拥有的某些系统上,我们对如此大的文本文件进行 gzip 压缩,将其减小到原始大小的 1/5 到 1/7。使这成为可能的部分原因是我们不需要处理这些文件
直到它们创建后几个小时,并且在创建时我们的机器上实际上没有任何其他负载。

处理它们或多或少以 zcat thosefiles 的方式完成 |我们的处理。(尽管使用定制的 zcat,但它是通过 unix 套接字完成的)。它以 CPU 时间换取磁盘 I/O 时间,对于我们的系统来说这是非常值得的。当然,有很多变量可能会使这个设计对于特定系统来说非常糟糕。

This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.

Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.

权谋诡计 2024-08-10 20:19:30

也许您已经阅读过此论坛主题,但如果没有:

http://www.perlmonks.org /?node_id=512221

它描述了使用 Perl 逐行执行此操作,用户似乎认为 Perl 非常有能力。

哦,可以处理 RAID 阵列中的文件吗?如果有多个镜像磁盘,那么读取速度可以提高。磁盘资源的竞争可能是导致多线程尝试不起作用的原因。

祝你好运。

Perhaps you've already read this forum thread, but if not:

http://www.perlmonks.org/?node_id=512221

It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.

Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.

Best of luck.

素手挽清风 2024-08-10 20:19:30

我希望我能更多地了解你的文件的内容,但除了它是文本之外一无所知,这听起来像是一个很好的 MapReduce 类型的问题。

PS,任何文件最快的读取都是线性读取。 cat 文件 > /dev/null 应该是文件读取的速度。

I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.

PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.

冰火雁神 2024-08-10 20:19:30

您是否考虑过流式传输文件并将任何有趣的结果过滤到辅助文件? (重复直到获得大小可管理的文件)。

Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).

遇见了你 2024-08-10 20:19:30

解析文件一次,逐行读取。将结果放入一个合适的数据库中的表中。根据需要运行任意数量的查询。定期用新传入的数据喂养这头野兽。

意识到操作 10 Gb 文件、通过(即使是本地)网络传输、探索复杂的解决方案等都需要时间。

Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.

Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.

走走停停 2024-08-10 20:19:30

既然你说平台和语言并不重要...

如果你想要一个与源介质允许的一样快的稳定性能,我知道在 Windows 上实现这一点的唯一方法是通过重叠的非操作系统-缓冲对齐的顺序读取。您可能可以使用两个或三个缓冲区达到一些 GB/s,除此之外,在某些时候您需要一个环形缓冲区(一个写入器,1 个以上读取器)以避免任何复制。确切的实现取决于驱动程序/API。如果在处理 IO 的线程(内核模式和用户模式)上进行任何内存复制,显然要复制的缓冲区越大,则浪费的时间就越多,而不是执行 IO。因此最佳缓冲区大小取决于固件和驱动程序。在 Windows 上,磁盘 IO 的最佳尝试值是 32 KB 的倍数。 Windows 文件缓冲、内存映射以及所有这些都会增加开销。仅当以随机访问方式多次读取相同数据时才有效。因此,对于一次顺序读取大文件,您不希望操作系统缓冲任何内容或执行任何 memcpy 操作。如果使用 C#,由于封送,调用操作系统也会受到惩罚,因此互操作代码可能需要一些优化,除非您使用 C++/CLI。

有些人更喜欢使用硬件来解决问题,但如果您有更多的时间而不是金钱,在某些情况下,可以进行优化,使单台消费级计算机的性能比 1000 台企业级计算机的性能提高 100-1000 倍。原因是,如果处理对延迟也敏感,那么超出使用两个内核可能会增加延迟。这就是为什么驱动程序可以推动千兆字节/秒,而企业软件在全部完成时却停留在兆字节/秒。无论报告、业务逻辑等企业软件做什么,如果像 80 年代编写游戏一样编写,也可以在两个核心消费者 CPU 上以千兆字节/秒的速度完成。我听说过以这种方式处理整个业务逻辑的最著名的例子是 LMAX 外汇交易所,它发布了一些基于环形缓冲区的代码,据说这是受到网卡驱动程序的启发。

如果您对 << 感到满意,请忘记所有理论。 1 GB/s,我发现 Windows 上的一个可能的起点是查看 winimage 中的 readfile 源,除非您想深入研究 sdk/驱动程序示例。它可能需要一些源代码修复才能在 SSD 速度下正确计算性能。还可以尝试缓冲区大小。
根据我的经验,开关 /h 多线程和 /o 重叠(完成端口)IO 具有最佳缓冲区大小(尝试 32,64,128 KB 等),不使用 Windows 文件缓冲,在从 SSD(冷数据)读取同时处理时提供最佳性能(使用 /a 进行 Adler 处理,否则它太受 CPU 限制)。

Since you said platform and language doesn't matter...

If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.

Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.

Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).

树深时见影 2024-08-10 20:19:30

我似乎记得我们正在读取大文件的一个项目,我们的实现使用了多线程 - 基本上 n * worker_threads 开始于文件的递增偏移量(0,chunk_size,2xchunk_size,3x chunk_size ... n-1x chunk_size)并且是阅读较小的信息块。我不太记得我们这样做的原因,因为整个事情是由其他人设计的——工人并不是唯一的因素,但我们大致就是这样做的。

希望有帮助

I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.

Hope it helps

蛮可爱 2024-08-10 20:19:30

基本上需要“分而治之”,如果你有一个计算机网络,那么将10G文件复制到尽可能多的客户端PC上,让每个客户端PC读取该文件的一个偏移量。为了获得额外的好处,除了分布式读取之外,还可以让每台电脑都实现多线程。

Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.

茶花眉 2024-08-10 20:19:30

我有一位同事通过使用 64 位 Linux 来加快他的 FIX 阅读速度。如果值得的话,花一点钱购买一些更高级的硬件。

I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.

愚人国度 2024-08-10 20:19:30

嗯,但是 C 中的 read() 命令有什么问题呢?通常有2GB限制,
所以按顺序调用5次就可以了。那应该相当快。

hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.

捂风挽笑 2024-08-10 20:19:30

如果您受 I/O 限制并且您的文件位于单个磁盘上,则无需执行太多操作。对整个文件进行简单的单线程线性扫描是从磁盘获取数据的最快方法。使用大的缓冲区可能会有所帮助。

如果您可以说服文件的编写者将其跨多个磁盘/机器进行条带化,那么您可以考虑对读取器进行多线程处理(每个读取头一个线程,每个线程从单个条带读取数据)。

If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.

If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).

思慕 2024-08-10 20:19:30

问题中并没有说明顺序是否重要。所以,
将文件分成相等的部分,例如每个 1GB,并且由于您使用多个 CPU,因此多个线程不会成为问题,因此使用单独的线程读取每个文件,并使用容量 > 的 RAM。 10 GB,那么您的所有内容将存储在由多个线程读取的 RAM 中。

Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文