我有一些巨大(几千兆字节)的 ASCII 文本文件,我需要逐行读取它们,将某些列转换为浮点,并对这些数字执行一些简单的操作。这是非常简单的事情,只是我认为必须有一种方法来加快速度。该程序永远不会使用相当于 100% 的 CPU 内核,因为它花费大量时间等待 I/O。同时,它花费足够的时间进行计算而不是 I/O,因此它仅执行约 8-10 MB/秒的原始磁盘 I/O。我发现我的硬盘比这要好得多。
在单独的线程中进行 I/O 和处理可能会有帮助吗?如果是这样,实现这一点的有效方法是什么?一个重要的问题是如何分配内存来保存每一行,这样我就不会出现瓶颈。
编辑:我现在正在使用 D 编程语言,版本 2 标准库,主要是高级函数。 std.stdio.File 使用的缓冲区大小为 16 KB。
I have some gigantic (several gigabyte) ASCII text files that I need to read in line-by-line, convert certain columns to floating point, and do a few simple operations on these numbers. It's pretty straightforward stuff, except that I'm thinking that there has to be a way to speed it up a whole bunch. The program never uses the equivalent of 100% of a CPU core because it spends so much time waiting on I/O. At the same time, it spends enough time doing computations instead of I/O that it only does ~8-10 MB/sec of raw disk I/O. I've seen my hard drive do a lot better than that.
Would it likely help to do the I/O and processing in separate threads? If so, what's an efficient way of implementing this? An important issue is what to do with memory allocation for holding each line so that I don't bottleneck on that.
Edit: I'm using the D programming language, version 2 standard lib., mostly the higher level functions, for most of this stuff right now. The buffer size used by std.stdio.File is 16 KB.
发布评论
评论(4)
如果您没有达到 100% CPU,那么您就会受到 I/O 限制,并且不会通过多线程看到太多/任何改进 - 您只会有几个线程等待 I/O。事实上,如果他们访问文件的不同部分,您可能会引入磁盘查找并使事情变得更糟。
首先看看更简单的事情:能否增加可用于 I/O 的缓冲区 RAM 量? (例如,在 C++ 中,FILE 对象的标准 I/O 缓冲区很小(例如 4kB),设置较大的缓冲区(例如 64kB)可以对吞吐量产生巨大的影响)。
您能否在 I/O 请求中使用更大的缓冲区大小:例如,将 64KB 的原始数据读入一个大缓冲区,然后自己处理,而不是一次读取一行或一个字节。
你有输出任何数据吗?通过将其缓存在 RAM 中而不是立即将其写回磁盘,您可以将 IO 限制为纯粹读取输入文件,并帮助事情进展得更快。
您可能会发现,一旦加载大量数据缓冲区,您就开始受到 CPU 限制,此时您可以考虑多线程 - 一个线程读取数据,其他线程处理数据。
If you're not hitting 100% CPU then you're I/O bound, and won't see much/any improvement by multithreading - you'll just have several threads sitting waiting for I/O. Indeed, if they are accessing different parts of the file, you couldintroduce disk seeking and make things much worse.
Look first at the simpler things: Can you increase the amount of buffer RAM available for the I/O? (e.g. in C++, the standard I/O buffers for FILE objects are tiny (e.g. 4kB) setting a larger buffer (e.g. 64kB) can make a massive difference to the throughput).
Can you use larger buffer sizes in your I/O requests: e.g. Read 64KB of raw data into a large buffer, and then process that yourself, rather than reading one line or one byte at a time.
Are you outputting any data? By caching this in RAM instead of writing it immediately back to disk you can limit your IO to purely reading the input file, and help things go much faster.
You may find that once you are loading large buffers of data that you start to become CPU bound, at which point you can think about multithreading - one thread to read the data and other thread(s) to process it.
通常,操作系统会尝试提前读取,如果不受 CPU 限制,您应该会接近硬盘限制速度。
原因可能是:
当您受到CPU限制时,您应该开始寻找更有效的数据解析。
Normally the OS will try to read ahead and you should get near hard disk limit speeds if you are not CPU bound.
The cause can be:
The moment that you are CPU bound, should you start looking at more efficient parsing of the data.
如果您有足够的 RAM,您可以将整个文件读入字符串,在行分隔符上对其进行标记,并根据需要处理标记。
在 java 中,您可以使用 StringBuilder 对象将文件内容读入其中。您还希望使用以下内容启动具有足够内存限制(本例中为 2GB)的 jvm:
如果您不想将整个 文件读入字符串中,您可以迭代地读取它分批次进行处理。
事实上,根据文件格式的详细信息,您可能可以使用 CSVReader 一个开源 Java 包 (项目页面)使用 readAll() 方法将文件读入内存,最终会得到一个
List
你可以骑着它去城里:)。If you've got enough RAM, you could read the whole file into a string, tokenize it on line delimiters and process the tokens however you want.
In java you would use a StringBuilder object to read the file contents into it. You'd also want to launch the jvm with a sufficient memory limit (2GB in this example) using something like:
If you don't want to read the whole file into a string you could iteratively read it in batches and process the batches.
In fact, depending on the details of your file format, you could probably use CSVReader an open source Java package (project page) to read your file into memory ala the readAll() method, and you'll end up with a
List<String[]>
and you can go to town on it :).首先,我会获取您拥有的程序,并获取它的堆栈快照。这可以确定 I/O 花费了多少时间,CPU 花费了多少时间。
然后,如果 I/O 占主导地位,我将确保读取尽可能大的缓冲区,以最大限度地减少磁盘头运动。
然后,如果我看到 I/O 在 CPU 上等待,然后 CPU 在等待 I/O,我会尝试执行异步 I/O,以便在 CPU 在另一个缓冲区上运行时可以加载一个缓冲区。 (或者您可以使用读取器线程来执行此操作,读入备用缓冲区。)
如果 I/O 不占主导地位而 CPU 占主导地位,那么我会看到堆栈快照告诉我有关 CPU 活动的信息。如果过多的时间花费在浮点数的反格式上,并且如果数字的格式相当简单,我会考虑自己解析它们,因为我可以利用更简单的格式。
这有帮助吗?
First of all, I would take the program you've got, and get stackshots of it. That will tell for certain how much time is spent in I/O, and how much in CPU.
Then, if I/O is dominant, I would make sure I'm reading buffers as large as possible, to minimize disk head motions.
Then, if I'm seeing I/O waiting on CPU, followed by CPU waiting on I/O, I would try to do asynchronous I/O, so that one buffer could be loading while the CPU runs on the other. (Or you could do that with a reader thread, reading into alternate buffers.)
If I/O is not dominant and CPU is dominant, then I would see what stackshots tell me about the CPU activity. If an inordinate percent of time is being spent in the de-formatting of floating point numbers, and if the numbers are of fairly simple format, I would consider parsing them myself, because I can take advantage of the simpler format.
Does that help?