在 Python 中有效地将文本添加到非常大的文本文件中
我必须在现有但非常大(2 - 10 GB 范围)的文本文件中添加一些任意文本。由于文件太大,我试图避免将整个文件读入内存。但我对于逐行迭代是否过于保守?与当前的方法相比,转向 readlines(sizehint) 方法是否会给我带来很大的性能优势?
最后的删除和移动不太理想,但据我所知,没有办法对线性数据进行这种操作。但我不太精通 Python——也许我可以利用 Python 的一些独特之处来做得更好?
import os
import shutil
def prependToFile(f, text):
f_temp = generateTempFileName(f)
inFile = open(f, 'r')
outFile = open(f_temp, 'w')
outFile.write('# START\n')
outFile.write('%s\n' % str(text))
outFile.write('# END\n\n')
for line in inFile:
outFile.write(line)
inFile.close()
outFile.close()
os.remove(f)
shutil.move(f_temp, f)
I have to prepend some arbitrary text to an existing, but very large (2 - 10 GB range) text file. With the file being so large, I'm trying to avoid reading the entire file in to memory. But am I being too conservative with a line-by-line iteration? Would moving to a readlines(sizehint) approach give me much of a performance advantage over my current approach?
The delete-and-move at the end is less than ideal but, as far as I know, there's no way to do this sort of manipulation with linear data, in place. But I'm not so well versed in Python -- maybe there's something unique to Python I can exploit to do this better?
import os
import shutil
def prependToFile(f, text):
f_temp = generateTempFileName(f)
inFile = open(f, 'r')
outFile = open(f_temp, 'w')
outFile.write('# START\n')
outFile.write('%s\n' % str(text))
outFile.write('# END\n\n')
for line in inFile:
outFile.write(line)
inFile.close()
outFile.close()
os.remove(f)
shutil.move(f_temp, f)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果这是在 Windows NTFS 上,您可以插入到文件的中间。 (或者有人告诉我,我不是 Windows 开发人员)。
如果这是在 POSIX(Linux 或 Unix)系统上,您应该像其他人所说的那样使用“cat”。 cat 非常高效,使用书中的每一个技巧来获得最佳性能(即,无效复制缓冲区等)。
但是,如果您必须在 python 中执行此操作,则可以通过使用shutil.copyfileobj() 来改进您提供的代码(它需要2个文件句柄)和tempfile.TemporaryFile(创建一个在关闭时自动删除的文件):
我认为不需要os.remove(f),因为shutil.move()将删除f。但是,您应该对此进行测试。此外,“delete=False”可能不需要,但保留它可能是安全的。
If this is on Windows NTFS, you can insert into the middle of a file. (Or so I'm told, I'm not a Windows developer).
If this is on a POSIX (Linux or Unix) system, you should use "cat" as someone else said. cat is wickedly efficient, using every trick in the book to get optimal performance (ie. voids copying buffers, etc.)
However, if you must do it in python, the code you presented could be improved by using shutil.copyfileobj() (which takes 2 file handles) and tempfile.TemporaryFile (create a file that automatically gets deleted on close):
I think the os.remove(f) isn't needed as shutil.move() will delete f. However, you should test that. Also, the "delete=False" may not be needed but may be safe to leave it.
您想要做的是读取大块(从 64k 到几 MB 的任意位置)的文件并将这些块写出。换句话说,不要使用单独的线,而是使用巨大的块。这样您就可以执行最少的 I/O,并且希望您的进程是 I/O 密集型而不是 CPU 密集型。
What you want to do is read the file in large (anywhere from 64k to several MB) blocks and write the blocks out. In other words, instead of individual lines, use huge blocks. That way you do the fewest I/Os possible and hopefully your process is I/O-bound instead of CPU-bound.
您可以使用更适合该工作的工具
os.system("cat file1 file2 > file3")
You can use tools better suited to the job
os.system("cat file1 file2 > file3")
老实说,如果您担心执行时间,我建议您直接用 C 语言编写。从 Python 进行系统调用可能会非常慢,而且无论您采用逐行读取还是原始块读取方法,您都必须执行很多这些系统调用,这确实会拖慢速度。
To be honest, I would recommend you just write this in C instead if you're worried about execution time. Doing system calls from Python can be quite slow, and since you'll have to do a lot of them whether you do the line-by-line or raw block read approach, that will really drag things down.