Python 中迭代大文件 (10GB+) 的最有效方法

发布于 2024-11-12 20:17:36 字数 1152 浏览 4 评论 0原文

我正在编写一个 Python 脚本来浏览两个文件 - 一个包含 UUID 列表，另一个包含大量日志条目 - 每行包含另一个文件中的一个 UUID。该程序的目的是从 file1 创建 UUID 列表，然后每次在日志文件中找到 UUID 时，每次找到匹配项时都会增加关联值。

长话短说，计算每个 UUID 在日志文件中出现的次数。目前，我有一个列表，其中填充了 UUID 作为键，“点击”作为值。然后另一个循环遍历日志文件的每一行，并检查日志中的 UUID 是否与 UUID 列表中的 UUID 匹配。如果匹配，则增加该值。

    for i, logLine in enumerate(logHandle):         #start matching UUID entries in log file to UUID from rulebase
        if logFunc.progress(lineCount, logSize):    #check progress
            print logFunc.progress(lineCount, logSize)  #print progress in 10% intervals
        for uid in uidHits:
            if logLine.count(uid) == 1:             #for each UUID, check the current line of the log for a match in the UUID list
                uidHits[uid] += 1                   #if matched, increment the relevant value in the uidHits list
                break                                #as we've already found the match, don't process the rest
        lineCount += 1

它可以正常工作 - 但我确信有一种更有效的方法来处理文件。我浏览了一些指南，发现使用“count”比使用编译的正则表达式更快。我认为按块读取文件而不是逐行读取文件可以通过减少磁盘 I/O 时间来提高性能，但测试文件 ~200MB 的性能差异可以忽略不计。如果有人有任何其他方法，我将非常感激:)

原文

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.

So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.

    for i, logLine in enumerate(logHandle):         #start matching UUID entries in log file to UUID from rulebase
        if logFunc.progress(lineCount, logSize):    #check progress
            print logFunc.progress(lineCount, logSize)  #print progress in 10% intervals
        for uid in uidHits:
            if logLine.count(uid) == 1:             #for each UUID, check the current line of the log for a match in the UUID list
                uidHits[uid] += 1                   #if matched, increment the relevant value in the uidHits list
                break                                #as we've already found the match, don't process the rest
        lineCount += 1

It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

望她远 2024-11-19 20:17:36

从功能上思考！

编写一个函数，该函数将获取日志文件的一行并返回 uuid。比如说，将其称为uuid。
将此函数应用于日志文件的每一行。如果您使用的是Python 3，则可以使用内置函数map；否则，您需要使用 itertools.imap。

将此迭代器传递给 collections.Counter。

collections.Counter(map(uuid, open("log.txt")))

这将是非常高效的。

一些评论：

这完全忽略了 UUID 列表，只计算日志文件中出现的 UUID。如果您不希望这样，则需要对程序进行一些修改。
- 您的代码速度很慢，因为您使用了错误的数据结构。字典就是你想要的。

Think functionally!

Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.

Pass this iterator to a collections.Counter.

collections.Counter(map(uuid, open("log.txt")))

This will be pretty much optimally efficient.

A couple comments:

This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
- Your code is slow because you are using the wrong data structures. A dict is what you want here.

回复收藏 0 原文

傲性难收 2024-11-19 20:17:36

这不是对您问题的 5 行答案，但 PyCon'08 上有一个很棒的教程，名为 系统程序员的生成器技巧。还有一个名为关于协程和并发的好奇课程的后续教程。

生成器教程专门使用大日志文件处理作为示例。

回复收藏 0 原文

格子衫的從容 2024-11-19 20:17:36

正如上面的人所说，对于 10GB 的文件，您可能很快就会达到磁盘的极限。对于仅代码的改进，生成器的建议很棒。在 python 2.x 中，它看起来像这样：

uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)

听起来这实际上不一定是 python 问题。如果你没有做任何比计算 UUID 更复杂的事情，Unix 可能比 python 更快地解决你的问题。

cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like

uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)

It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.

cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

回复收藏 0 原文

躲猫猫 2024-11-19 20:17:36

您尝试过 mincemeat.py 吗？它是 MapReduce 分布式计算框架的 Python 实现。我不确定您是否会获得性能提升，因为在使用它之前我还没有处理过 10GB 的数据，尽管您可能会探索这个框架。

回复收藏 0 原文

寄意 2024-11-19 20:17:36

尝试使用分析器http://docs.python.org/library/ 来衡量花费最多时间的地方profile.html

最佳优化位置取决于数据的性质：如果 uuid 列表不是很长，您可能会发现，例如，大部分时间都花在“如果logFunc.progress(lineCount, logSize)"。如果列表非常长，则可以将 uidHits.keys() 的结果保存到循环外部的变量中，并对其进行迭代，而不是字典本身，但 Rosh Oxymoron 建议首先找到 id，然后在 uidHits 中检查它可能会更有帮助。

无论如何，您都可以消除 lineCount 变量，并使用 i 代替。如果行很长，find(uid) != -1 可能比 count(uid) == 1 更好。