Python 对文件中的频率求和
我有一个大文件(950MB),其中包含单词和频率如下,每行一个:
word1 54
word2 1
word3 12
word4 3
word1 99
word4 147
word1 4
word2 6
等等...
我需要对单词的频率进行求和,例如,word1 = 54 + 99 + 4 = 157,并将其输出到列表/文件。 在 Python 中执行此操作最有效的方法是什么?
我试图做的是创建一个列表,其中每一行都是该列表中的一个元组,从那里求和,这使我的笔记本电脑崩溃了......
I have a large file (950MB) that conains words and frequencies as follows, one per line:
word1 54
word2 1
word3 12
word4 3
word1 99
word4 147
word1 4
word2 6
etc...
I need to sum the frequencies for the words, e.g word1 = 54 + 99 + 4 = 157, and output this to a list/file.
What is the most efficient way of doing this in Python?
What I tried to do was create a list with each line being a tuple in this list, summing from there, this crashed my laptop...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
接下来尝试:
Try next:
您不必将整个文件读入内存 。您还可以将文件拆分为多个较小的文件,单独处理每个文件并合并结果/频率。
You don't have to read the whole file into memory. You could also split the file into multiple smaller files, process each file separately and merge the results/frequencies.
对于大多数现代机器来说,950MB 的内存并不算太多。我已经在 Python 程序中多次这样做过,并且我的机器有 4GB 物理内存。我可以想象用更少的内存来做同样的事情。
如果你能避免的话,你绝对不想浪费内存。上一篇文章提到逐行处理文件并累积结果,这是正确的方法。
如果您避免一次将整个文件读入内存,则只需担心累积结果占用了多少内存,而不是文件本身。可以处理比您提到的文件大得多的文件,前提是您保存在内存中的结果不会变得太大。如果是这样,那么您将需要开始将部分结果另存为文件本身,但听起来这个问题并不需要这样做。
这可能是解决您的问题的最简单的解决方案:
如果您使用的是 Linux 或其他类似 UNIX 的操作系统,请使用
top
来监视程序运行时的内存使用情况。950MB shouldn't be too much for most modern machines to keep in memory. I've done this plenty of times in Python programs, and my machine has 4GB of physical memory. I can imagine doing the same with less memory too.
You definitely don't want to waste memory if you can avoid it though. A previous post mentioned processing the file line by line and accumulating a result, which is the right way to do it.
If you avoid reading the whole file into memory at once, you only have to worry about how much memory your accumulated result is taking, not the file itself. It can be possible to process files much larger than the one you mentioned, provided the result you keep in memory doesn't grow too large. If it does, then you'll want to start saving partial results as files themselves, but it doesn't sound like this problem requires that.
Here's perhaps the simplest solution to your problem:
If you're on Linux or another UNIX-like OS, use
top
to keep an eye on memory usage while the program runs.