如何在Python中廉价地获取大文件的行数
如何以最节省内存和时间的方式获取大文件的行数?
def file_len(filename):
with open(filename) as f:
for i, _ in enumerate(f):
pass
return i + 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(30)
一行,比OP的
for
循环更快(尽管不是最快)并且非常简洁:您还可以通过使用
提高速度(和鲁棒性) rbU
模式并将其包含在with
块中以关闭文件:注意:
中的
模式自 Python 3.3 及更高版本起已被弃用,因此我们应该使用U
rbUrb
而不是rbU
(它已在 Python 3.11)。One line, faster than the
for
loop of the OP (although not the fastest) and very concise:You can also boost the speed (and robustness) by using
rbU
mode and include it in awith
block to close the file:Note: The
U
inrbU
mode is deprecated since Python 3.3 and above, so iwe should userb
instead ofrbU
(and it has been removed in Python 3.11).没有比这更好的了。
毕竟,任何解决方案都必须读取整个文件,计算出有多少个
\n
,然后返回该结果。您是否有更好的方法来做到这一点而不读取整个文件? 不确定...最好的解决方案总是受 I/O 限制,您能做的最好的事情就是确保不使用不必要的内存,但看起来您已经涵盖了这一点。
[2023 年 5 月编辑]
正如许多其他答案中所评论的那样,Python 3 中有更好的替代方案。
for
循环并不是最有效的。 例如,使用mmap
或缓冲区更有效。You can't get any better than that.
After all, any solution will have to read the entire file, figure out how many
\n
you have, and return that result.Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.
[Edit May 2023]
As commented in many other answers, in Python 3 there are better alternatives. The
for
loop is not the most efficient. For example, usingmmap
or buffers is more efficient.我相信内存映射文件将是最快的解决方案。 我尝试了四个函数:OP发布的函数(
opcount
); 对文件中各行的简单迭代 (simplecount
); 带有内存映射字段 (mmap) 的 readline (mapcount
); 以及 Mykola Kharechko 提供的缓冲区读取解决方案 (bufcount
)。我运行每个函数五次,并计算了 120 万行文本文件的平均运行时间。
Windows XP、Python 2.5、2 GB RAM、2 GHz AMD 处理器
以下是我的结果:
数字对于Python 2.6:
所以缓冲区读取策略对于Windows/Python 2.6来说似乎是最快的
下面是代码:
I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (
opcount
); a simple iteration over the lines in the file (simplecount
); readline with a memory-mapped filed (mmap) (mapcount
); and the buffer read solution offered by Mykola Kharechko (bufcount
).I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.
Windows XP, Python 2.5, 2 GB RAM, 2 GHz AMD processor
Here are my results:
Numbers for Python 2.6:
So the buffer read strategy seems to be the fastest for Windows/Python 2.6
Here is the code:
所有这些解决方案都忽略了一种使其运行速度更快的方法,即使用无缓冲(原始)接口、使用字节数组以及进行自己的缓冲。 (这只适用于Python 3。在Python 2中,默认情况下可能会或可能不会使用原始接口,但在Python 3中,您将默认使用Unicode。)
使用计时工具的修改版本,我相信以下内容代码比提供的任何解决方案都更快(并且稍微更Pythonic):
使用单独的生成器函数,运行速度更快:
这可以使用itertools内联生成器表达式完全完成,但看起来很奇怪:
这里我的时间安排是:
All of these solutions ignore one way to make this run considerably faster, namely by using the unbuffered (raw) interface, using bytearrays, and doing your own buffering. (This only applies in Python 3. In Python 2, the raw interface may or may not be used by default, but in Python 3, you'll default into Unicode.)
Using a modified version of the timing tool, I believe the following code is faster (and marginally more Pythonic) than any of the solutions offered:
Using a separate generator function, this runs a smidge faster:
This can be done completely with generators expressions in-line using itertools, but it gets pretty weird looking:
Here are my timings:
您可以执行子进程并运行
wc -l filename
You could execute a subprocess and run
wc -l filename
经过 perfplot 分析后,我们必须推荐缓冲读取解决方案,
它速度快且内存效率高。 大多数其他解决方案的速度要慢 20 倍左右。
重现情节的代码:
After a perfplot analysis, one has to recommend the buffered read solution
It's fast and memory-efficient. Most other solutions are about 20 times slower.
Code to reproduce the plot:
类似于这个答案的单行 Bash 解决方案,使用现代的
subprocess.check_output
函数:A one-line Bash solution similar to this answer, using the modern
subprocess.check_output
function:这是一个使用多处理库在机器/核心之间分配行计数的 Python 程序。 我的测试使用 8 核 Windows 64 位服务器将 2000 万行文件的计数从 26 秒缩短到 7 秒。 注意:不使用内存映射会使速度变慢。
Here is a Python program to use the multiprocessing library to distribute the line counting across machines/cores. My test improves counting a 20 million line file from 26 seconds to 7 seconds using an 8-core Windows 64-bit server. Note: not using memory mapping makes things much slower.
我将使用 Python 的文件对象方法
readlines
,如下所示:这将打开文件,在文件中创建行列表,计算列表的长度,将其保存到变量并再次关闭文件。
I would use Python's file object method
readlines
, as follows:This opens the file, creates a list of lines in the file, counts the length of the list, saves that to a variable and closes the file again.
这是我发现使用纯 Python 最快的事情。
您可以通过设置
buffer
使用任意数量的内存,尽管 2**16 似乎是我计算机上的最佳选择。我在这里找到了答案 为什么要阅读C++ 中来自 stdin 的行比 Python 慢得多? 并对其进行了一点点调整。 尽管
wc -l
仍然比其他任何方法快大约 75%,但对于了解如何快速计算行数来说,这是一本非常好的读物。This is the fastest thing I have found using pure Python.
You can use whatever amount of memory you want by setting
buffer
, though 2**16 appears to be a sweet spot on my computer.I found the answer here Why is reading lines from stdin much slower in C++ than Python? and tweaked it just a tiny bit. It’s a very good read to understand how to count lines quickly, though
wc -l
is still about 75% faster than anything else.这是我使用的,它看起来很干净:
这比使用纯 Python 稍微快一些,但以内存使用为代价。 子进程在执行命令时将分叉一个与父进程具有相同内存占用的新进程。
Here is what I use, and it seems pretty clean:
This is marginally faster than using pure Python, but at the cost of memory usage. Subprocess will fork a new process with the same memory footprint as the parent process while it executes your command.
一行解决方案:
我的代码片段:
输出:
One line solution:
My snippet:
Output:
凯尔的答案
可能是最好的。 另一种方法是:
以下是两者的性能比较:
Kyle's answer
is probably best. An alternative for this is:
Here is the comparison of performance of both:
我在这个版本中得到了一个小的(4-8%)改进,它重用了常量缓冲区,因此它应该避免任何内存或 GC 开销:
您可以调整缓冲区大小,也许会看到一些改进。
I got a small (4-8%) improvement with this version which reuses a constant buffer, so it should avoid any memory or GC overhead:
You can play around with the buffer size and maybe see a little improvement.
对我来说,这个变体将是最快的:
原因:缓冲比逐行读取更快,并且
string.count
也非常快As for me this variant will be the fastest:
reasons: buffering faster than reading line by line and
string.count
is also very fast这段代码更短、更清晰。 这可能是最好的方法:
This code is shorter and clearer. It's probably the best way:
只是为了完成之前答案中的方法,我尝试了 fileinput 模块的变体:
并将一个 6000 万行文件传递给之前答案中的所有所述方法:
令我感到有点惊讶的是 fileinput 如此糟糕并且规模如此之大比所有其他方法更糟糕...
Just to complete the methods in previous answers, I tried a variant with the fileinput module:
And passed a 60-million-lines file to all the stated methods in previous answers:
It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...
我已经修改了这样的缓冲区情况:
现在也计算空文件和最后一行(没有 \n)。
I have modified the buffer case like this:
Now also empty files and the last line (without \n) are counted.
已经有很多答案具有很好的时序比较,但我相信他们只是查看行数来衡量性能(例如,伟大的图来自 Nico Schlömer)。
为了在衡量性能时准确,我们应该查看:
首先,OP 的功能(带有一个
for
) 和函数sum(1 for line in f)
的表现不太好...好的竞争者是
mmap
或>缓冲区
。总结一下:根据我的分析(带 SSD 的 Windows 上的 Python 3.9):
对于行相对较短(100 个字符以内)的大文件:使用带有缓冲区的函数
buf_count_newlines_gen
对于行可能较长(最多 2000 个字符)的文件,忽略行数:使用函数使用 mmap:
count_nb_lines_mmap
对于性能非常好的短代码(特别是对于大小大于的文件)至中等尺寸):
以下是不同指标的摘要(7 次运行的
timeit
的平均时间,每次运行 10 个循环)count_nb_lines_blocks
count_nb_lines_mmap
buf_count_newlines_gen
itercount
注意:我还在 8 GB 的文件上比较了
count_nb_lines_mmap
和buf_count_newlines_gen
,其中 970 万行超过 800人物。buf_count_newlines_gen
的平均时间为 5.39 秒,count_nb_lines_mmap
的平均时间为 4.2 秒,因此后一个函数对于行数较长的文件来说似乎确实更好。这是我使用过的代码:
There are already so many answers with great timing comparison, but I believe they are just looking at number of lines to measure performance (e.g., the great graph from Nico Schlömer).
To be accurate while measuring performance, we should look at:
First of all, the function of the OP (with a
for
) and the functionsum(1 for line in f)
are not performing that well...Good contenders are with
mmap
orbuffer
.To summarize: based on my analysis (Python 3.9 on Windows with SSD):
For big files with relatively short lines (within 100 characters): use function with a buffer
buf_count_newlines_gen
For files with potentially longer lines (up to 2000 characters), disregarding the number of lines: use function with mmap:
count_nb_lines_mmap
For a short code with very good performance (especially for files of size up to medium size):
Here is a summary of the different metrics (average time with
timeit
on 7 runs with 10 loops each):count_nb_lines_blocks
count_nb_lines_mmap
buf_count_newlines_gen
itercount
Note: I have also compared
count_nb_lines_mmap
andbuf_count_newlines_gen
on a file of 8 GB, with 9.7 million lines of more than 800 characters. We got an average of 5.39 seconds forbuf_count_newlines_gen
vs. 4.2 seconds forcount_nb_lines_mmap
, so this latter function seems indeed better for files with longer lines.Here is the code I have used:
如果想在 Linux 中便宜地获取 Python 的行数,我推荐这种方法:
file_path 可以是抽象文件路径,也可以是相对路径。 希望这会有所帮助。
If one wants to get the line count cheaply in Python in Linux, I recommend this method:
file_path can be both abstract file path or relative path. Hope this may help.
这是对其他一些答案的元评论。
行读取和缓冲
\n
计数技术不会为每个文件返回相同的答案,因为某些文本文件在最后一行末尾没有换行符。 您可以通过检查最后一个非空缓冲区的最后一个字节并在不是b'\n'
时添加 1 来解决此问题。在 Python 3 中,以文本模式和二进制模式打开文件可能会产生不同的结果,因为文本模式默认将 CR、LF 和 CRLF 识别为行结尾(将它们全部转换为
'\n'< /code>),而在二进制模式下,如果计算
b'\n'
,则仅计算 LF 和 CRLF。 无论您是按行读取还是固定大小的缓冲区读取,这都适用。 经典 Mac OS 使用 CR 作为行结尾; 我不知道这些文件现在有多常见。缓冲区读取方法使用有限数量的 RAM,与文件大小无关,而行读取方法可以在最坏的情况下立即将整个文件读入 RAM(特别是如果文件使用 CR 行结尾)。 在最坏的情况下,由于动态调整行缓冲区大小以及(如果在文本模式下打开)Unicode 解码和存储的开销,它可能会使用比文件大小更多的 RAM。
您可以通过预分配字节数组并使用
readinto
而不是read
来提高缓冲方法的内存使用率,甚至可能提高速度。 现有的答案之一(几乎没有投票)可以做到这一点,但它有错误(它重复计算了一些字节)。顶部缓冲区读取答案使用大缓冲区 (1MiB)。 由于操作系统预读,使用较小的缓冲区实际上可以更快。 如果您一次读取 32K 或 64K,操作系统可能会在您请求之前开始将下一个 32K/64K 读入缓存,并且每次访问内核都会几乎立即返回。 如果您一次读取 1MiB,则操作系统不太可能推测读取整个兆字节。 它可能会预读较少的数据,但您仍然会花费大量时间坐在内核中等待磁盘返回其余数据。
This is a meta-comment on some of the other answers.
The line-reading and buffered
\n
-counting techniques won't return the same answer for every file, because some text files have no newline at the end of the last line. You can work around this by checking the last byte of the last nonempty buffer and adding 1 if it's notb'\n'
.In Python 3, opening the file in text mode and in binary mode can yield different results, because text mode by default recognizes CR, LF, and CRLF as line endings (converting them all to
'\n'
), while in binary mode only LF and CRLF will be counted if you countb'\n'
. This applies whether you read by lines or into a fixed-size buffer. The classic Mac OS used CR as a line ending; I don't know how common those files are these days.The buffer-reading approach uses a bounded amount of RAM independent of file size, while the line-reading approach could read the entire file into RAM at once in the worst case (especially if the file uses CR line endings). In the worst case it may use substantially more RAM than the file size, because of overhead from dynamic resizing of the line buffer and (if you opened in text mode) Unicode decoding and storage.
You can improve the memory usage, and probably the speed, of the buffered approach by pre-allocating a bytearray and using
readinto
instead ofread
. One of the existing answers (with few votes) does this, but it's buggy (it double-counts some bytes).The top buffer-reading answer uses a large buffer (1 MiB). Using a smaller buffer can actually be faster because of OS readahead. If you read 32K or 64K at a time, the OS will probably start reading the next 32K/64K into the cache before you ask for it, and each trip to the kernel will return almost immediately. If you read 1 MiB at a time, the OS is unlikely to speculatively read a whole megabyte. It may preread a smaller amount but you will still spend a significant amount of time sitting in the kernel waiting for the disk to return the rest of the data.
已经有很多答案了,但不幸的是,它们中的大多数只是一个几乎无法优化的问题上的微小经济体......
我参与了几个项目,其中行数是软件的核心功能,并尽可能快地处理巨大的问题。文件数量至关重要。
行计数的主要瓶颈是 I/O 访问,因为您需要读取每一行才能检测行返回字符,根本没有办法解决。 第二个潜在瓶颈是内存管理:一次加载的越多,处理速度就越快,但与第一个瓶颈相比,这个瓶颈可以忽略不计。
因此,除了诸如禁用 GC收集和其他微观管理技巧:
硬件解决方案:主要且最明显的方法是非编程:购买非常快的SSD/闪存硬盘驾驶。 到目前为止,这是获得最大速度提升的方法。
数据预处理和行并行化:如果您生成或可以修改您处理的文件的生成方式,或者您可以预处理它们。 首先将行 return 转换为 Unix 样式 (
\n
),因为与 Windows 相比,这将节省 1 个字符(节省的不多,但很容易获得),其次也是最重要的是,您可以编写固定长度的行。 如果您需要可变长度,如果长度变化不是那么大,您可以填充较小的线。 这样,您可以立即计算出文件总大小的行数,从而访问速度更快。 此外,通过固定长度的行,您通常不仅可以预先分配内存以加快处理速度,而且还可以并行处理行! 当然,并行化对于随机访问 I/O 比 HDD 快得多的闪存/SSD 磁盘效果更好。通常,问题的最佳解决方案是对其进行预处理,以便它更好地满足您的最终目的。磁盘并行化+硬件解决方案:如果您可以购买多个硬盘(如果可能的话还可以购买SSD闪存盘),那么您甚至可以通过利用并行化、存储您的数据来超越一个磁盘的速度。在磁盘之间以平衡的方式(最简单的是按总大小平衡)文件,然后从所有这些磁盘并行读取。 然后,您可以期望获得与您拥有的磁盘数量成比例的乘数提升。 如果购买多个磁盘不适合您,那么并行化可能不会有帮助(除非您的磁盘像某些专业级磁盘一样具有多个读取头,但即使如此,磁盘的内部高速缓存和 PCB 电路也可能成为瓶颈并阻止您完全并行地使用所有磁头,此外,您还必须为您将使用的该硬盘设计特定的代码,因为您需要知道确切的簇映射,以便将文件存储在不同磁头下的簇上,等等之后你可以用不同的头脑来阅读它们)。 事实上,众所周知,顺序读取几乎总是比随机读取快,并且单个磁盘上的并行化将具有比顺序读取更类似于随机读取的性能(您可以使用CrystalDiskMark)。
如果这些都不是一个选择,那么您只能依靠微观管理技巧来将行计数功能的速度提高几个百分点,但不要指望有任何真正重要的效果。 相反,您可以预期,与您将看到的速度改进的回报相比,您花费在调整上的时间将是不成比例的。
There are a lot of answers already, but unfortunately most of them are just tiny economies on a barely optimizable problem...
I worked on several projects where line count was the core function of the software, and working as fast as possible with a huge number of files was of paramount importance.
The main bottleneck with line count is I/O access, as you need to read each line in order to detect the line return character, there is simply no way around. The second potential bottleneck is memory management: the more you load at once, the faster you can process, but this bottleneck is negligible compared to the first.
Hence, there are three major ways to reduce the processing time of a line count function, apart from tiny optimizations such as disabling GC collection and other micro-managing tricks:
Hardware solution: the major and most obvious way is non-programmatic: buy a very fast SSD/flash hard drive. By far, this is how you can get the biggest speed boosts.
Data preprocessing and lines parallelization: if you generate or can modify how the files you process are generated, or if it's acceptable that you can preprocess them. First convert the line return to Unix style (
\n
) as this will save 1 character compared to Windows (not a big save, but it's an easy gain), and secondly and most importantly, you can potentially write lines of fixed length. If you need variable length, you can pad smaller lines if the length variability is not that big. This way, you can calculate instantly the number of lines from the total file size, which is much faster to access. Also, by having fixed length lines, not only can you generally pre-allocate memory which will speed up processing, but also you can process lines in parallel! Of course, parallelization works better with a flash/SSD disk that has much faster random access I/O than HDDs.. Often, the best solution to a problem is to preprocess it so that it better fits your end purpose.Disks parallelization + hardware solution: if you can buy multiple hard disks (and if possible SSD flash disks), then you can even go beyond the speed of one disk by leveraging parallelization, by storing your files in a balanced way (easiest is to balance by total size) among disks, and then read in parallel from all those disks. Then, you can expect to get a multiplier boost in proportion with the number of disks you have. If buying multiple disks is not an option for you, then parallelization likely won't help (except if your disk has multiple reading headers like some professional-grade disks, but even then the disk's internal cache memory and PCB circuitry will likely be a bottleneck and prevent you from fully using all heads in parallel, plus you have to devise a specific code for this hard drive you'll use because you need to know the exact cluster mapping so that you store your files on clusters under different heads, and so that you can read them with different heads after). Indeed, it's commonly known that sequential reading is almost always faster than random reading, and parallelization on a single disk will have a performance more similar to random reading than sequential reading (you can test your hard drive speed in both aspects using CrystalDiskMark for example).
If none of those are an option, then you can only rely on micromanaging tricks to improve by a few percents the speed of your line counting function, but don't expect anything really significant. Rather, you can expect the time you'll spend tweaking will be disproportionate compared to the returns in speed improvement you'll see.
使用 Numba
我们可以使用 Numba 将我们的函数 JIT(及时)编译为机器代码。
def numbacountparallel(fname)
运行速度提高 2.8 倍比问题中的
def file_len(fname)
。注意:
在运行基准测试之前,操作系统已经将文件缓存到内存中,因为我在电脑上没有看到太多磁盘活动。
第一次读取文件时,时间会慢很多,使得使用 Numba 的时间优势变得微不足道。
第一次调用该函数时,JIT 编译需要额外的时间。
如果我们要做的不仅仅是计算行数,这将很有用。
Cython 是另一种选择。
结论
由于计算行数将受到 I/O 限制,因此请使用问题中的 def file_len(fname) ,除非您想做的不仅仅是计算行数。
每个函数 100 次调用的时间(以秒为单位)
Using Numba
We can use Numba to JIT (Just in time) compile our function to machine code.
def numbacountparallel(fname)
runs 2.8x fasterthan
def file_len(fname)
from the question.Notes:
The OS had already cached the file to memory before the benchmarks were run as I don't see much disk activity on my PC.
The time would be much slower when reading the file for the first time making the time advantage of using Numba insignificant.
The JIT compilation takes extra time the first time the function is called.
This would be useful if we were doing more than just counting lines.
Cython is another option.
Conclusion
As counting lines will be I/O bound, use the def file_len(fname) from the question unless you want to do more than just count lines.
Time in seconds for 100 calls to each function
简单方法:
方法1
输出:
方法 2
输出:
方法 3
Simple methods:
Method 1
Output:
Method 2
Output:
Method 3
大文件的替代方法是使用
xreadlines():
对于 Python 3,请参阅:什么替代 xreadlines() Python 3?
An alternative for big files is using
xreadlines():
For Python 3 please see: What substitutes xreadlines() in Python 3?
打开文件的结果是一个迭代器,它可以转换为一个序列,该序列具有长度:
这比显式循环更简洁,并且避免了枚举。
The result of opening a file is an iterator, which can be converted to a sequence, which has a length:
This is more concise than your explicit loop, and avoids the
enumerate
.这可以工作:
This could work: