Python 中迭代大文件 (10GB+) 的最有效方法
我正在编写一个 Python 脚本来浏览两个文件 - 一个包含 UUID 列表,另一个包含大量日志条目 - 每行包含另一个文件中的一个 UUID。该程序的目的是从 file1 创建 UUID 列表,然后每次在日志文件中找到 UUID 时,每次找到匹配项时都会增加关联值。
长话短说,计算每个 UUID 在日志文件中出现的次数。 目前,我有一个列表,其中填充了 UUID 作为键,“点击”作为值。然后另一个循环遍历日志文件的每一行,并检查日志中的 UUID 是否与 UUID 列表中的 UUID 匹配。如果匹配,则增加该值。
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
它可以正常工作 - 但我确信有一种更有效的方法来处理文件。我浏览了一些指南,发现使用“count”比使用编译的正则表达式更快。我认为按块读取文件而不是逐行读取文件可以通过减少磁盘 I/O 时间来提高性能,但测试文件 ~200MB 的性能差异可以忽略不计。如果有人有任何其他方法,我将非常感激:)
I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
从功能上思考!
编写一个函数,该函数将获取日志文件的一行并返回 uuid。比如说,将其称为
uuid
。将此函数应用于日志文件的每一行。如果您使用的是Python 3,则可以使用内置函数map;否则,您需要使用 itertools.imap。
将此迭代器传递给 collections.Counter。
这将是非常高效的。
一些评论:
这完全忽略了 UUID 列表,只计算日志文件中出现的 UUID。如果您不希望这样,则需要对程序进行一些修改。
Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it
uuid
, say.Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
这不是对您问题的 5 行答案,但 PyCon'08 上有一个很棒的教程,名为 系统程序员的生成器技巧。还有一个名为关于协程和并发的好奇课程的后续教程。
生成器教程专门使用大日志文件处理作为示例。
This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.
正如上面的人所说,对于 10GB 的文件,您可能很快就会达到磁盘的极限。对于仅代码的改进,生成器的建议很棒。在 python 2.x 中,它看起来像这样:
听起来这实际上不一定是 python 问题。如果你没有做任何比计算 UUID 更复杂的事情,Unix 可能比 python 更快地解决你的问题。
Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
您尝试过 mincemeat.py 吗?它是 MapReduce 分布式计算框架的 Python 实现。我不确定您是否会获得性能提升,因为在使用它之前我还没有处理过 10GB 的数据,尽管您可能会探索这个框架。
Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.
尝试使用分析器http://docs.python.org/library/ 来衡量花费最多时间的地方profile.html
最佳优化位置取决于数据的性质:如果 uuid 列表不是很长,您可能会发现,例如,大部分时间都花在“如果logFunc.progress(lineCount, logSize)"。如果列表非常长,则可以将
uidHits.keys()
的结果保存到循环外部的变量中,并对其进行迭代,而不是字典本身,但 Rosh Oxymoron 建议首先找到 id,然后在 uidHits 中检查它可能会更有帮助。无论如何,您都可以消除
lineCount
变量,并使用i
代替。如果行很长,find(uid) != -1
可能比count(uid) == 1
更好。Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of
uidHits.keys()
to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.In any case, you can eliminate the
lineCount
variable, and usei
instead. Andfind(uid) != -1
might be better thancount(uid) == 1
if the lines are very long.