Python - 查找每个可能的单词对在文本文件的同一行中出现的频率的最有效方法?

发布于 2024-09-25 14:42:01 字数 420 浏览 6 评论 0原文

这个特殊问题很容易解决,但我不太确定我得到的解决方案在计算上是否高效。所以想请教专家!

浏览大文件、收集同一行中两个单词出现频率的统计信息(整个文件)的最佳方法是什么?

例如,如果文本仅包含以下两行:

“这是白色棒球。” “这些家伙有白色的棒球棒。”

您最终将收集以下统计数据: (这个,是:1),(这个,这个:1),(这个,白色:1),(这个,棒球:1),(这个,这个:1),(这个,白色:1),(这是、棒球:1) ...等等。

对于条目(棒球,白色:2),该值将为 2,因为这对单词在同一行中总共出现 2 次。

理想情况下,统计数据应该放在字典中,其中键在元组级别按字母顺序排列(即,您不希望“this,is”和“is,this”有单独的条目。我们不关心顺序这里:我们只想找出每个可能的单词对在整个文本的同一行中出现的频率。

This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. So I'm asking the experts!

What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line?

For instance, if the text contained only the following two lines:

"This is the white baseball."
"These guys have white baseball bats."

You would end up collecting the following stats:
(this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth.

For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times.

Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (i.e., you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

卸妝后依然美 2024-10-02 14:42:01
from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall('\w+', line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]
from collections import defaultdict
import itertools as it
import re

pairs = defaultdict(int)

for line in lines:
    for pair in it.combinations(re.findall('\w+', line), 2):
        pairs[tuple(pair)] += 1

resultList = [pair + (occurences, ) for pair, occurences in pairs.iterkeys()]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文