Python 对文件中的频率求和

发布于 2024-11-05 11:53:30 字数 347 浏览 4 评论 0原文

我有一个大文件(950MB),其中包含单词和频率如下,每行一个:

word1 54

word2 1

word3 12

word4 3

word1 99

word4 147

word1 4

word2 6

等等...

我需要对单词的频率进行求和,例如,word1 = 54 + 99 + 4 = 157,并将其输出到列表/文件。 在 Python 中执行此操作最有效的方法是什么?

我试图做的是创建一个列表,其中每一行都是该列表中的一个元组,从那里求和,这使我的笔记本电脑崩溃了......

I have a large file (950MB) that conains words and frequencies as follows, one per line:

word1 54

word2 1

word3 12

word4 3

word1 99

word4 147

word1 4

word2 6

etc...

I need to sum the frequencies for the words, e.g word1 = 54 + 99 + 4 = 157, and output this to a list/file.
What is the most efficient way of doing this in Python?

What I tried to do was create a list with each line being a tuple in this list, summing from there, this crashed my laptop...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

长不大的小祸害 2024-11-12 11:53:30

接下来尝试:

from collections import defaultdict

d = defaultdict(int)

with open('file') as fh:
    for line in fh:
        word, count = line.split()
        d[word] += count

Try next:

from collections import defaultdict

d = defaultdict(int)

with open('file') as fh:
    for line in fh:
        word, count = line.split()
        d[word] += count
你是年少的欢喜 2024-11-12 11:53:30

不必将整个文件读入内存 。您还可以将文件拆分为多个较小的文件,单独处理每个文件并合并结果/频率。

You don't have to read the whole file into memory. You could also split the file into multiple smaller files, process each file separately and merge the results/frequencies.

街道布景 2024-11-12 11:53:30

对于大多数现代机器来说,950MB 的内存并不算太多。我已经在 Python 程序中多次这样做过,并且我的机器有 4GB 物理内存。我可以想象用更少的内存来做同样的事情。

如果你能避免的话,你绝对不想浪费内存。上一篇文章提到逐行处理文件并累积结果,这是正确的方法。

如果您避免一次将整个文件读入内存,则只需担心累积结果占用了多少内存,而不是文件本身。可以处理比您提到的文件大得多的文件,前提是您保存在内存中的结果不会变得太大。如果是这样,那么您将需要开始将部分结果另存为文件本身,但听起来这个问题并不需要这样做。

这可能是解决您的问题的最简单的解决方案:

f = open('myfile.txt')
result = {}
for line in f:
    word, count = line.split()
    result[word] = int(count) + result.get(word, 0)
f.close()
print '\n'.join(result.items())

如果您使用的是 Linux 或其他类似 UNIX 的操作系统,请使用 top 来监视程序运行时的内存使用情况。

950MB shouldn't be too much for most modern machines to keep in memory. I've done this plenty of times in Python programs, and my machine has 4GB of physical memory. I can imagine doing the same with less memory too.

You definitely don't want to waste memory if you can avoid it though. A previous post mentioned processing the file line by line and accumulating a result, which is the right way to do it.

If you avoid reading the whole file into memory at once, you only have to worry about how much memory your accumulated result is taking, not the file itself. It can be possible to process files much larger than the one you mentioned, provided the result you keep in memory doesn't grow too large. If it does, then you'll want to start saving partial results as files themselves, but it doesn't sound like this problem requires that.

Here's perhaps the simplest solution to your problem:

f = open('myfile.txt')
result = {}
for line in f:
    word, count = line.split()
    result[word] = int(count) + result.get(word, 0)
f.close()
print '\n'.join(result.items())

If you're on Linux or another UNIX-like OS, use top to keep an eye on memory usage while the program runs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文