Python - 查找每个可能的单词对在文本文件的同一行中出现的频率的最有效方法?
这个特殊问题很容易解决,但我不太确定我得到的解决方案在计算上是否高效。所以想请教专家!
浏览大文件、收集同一行中两个单词出现频率的统计信息(整个文件)的最佳方法是什么?
例如,如果文本仅包含以下两行:
“这是白色棒球。” “这些家伙有白色的棒球棒。”
您最终将收集以下统计数据: (这个,是:1),(这个,这个:1),(这个,白色:1),(这个,棒球:1),(这个,这个:1),(这个,白色:1),(这是、棒球:1) ...等等。
对于条目(棒球,白色:2),该值将为 2,因为这对单词在同一行中总共出现 2 次。
理想情况下,统计数据应该放在字典中,其中键在元组级别按字母顺序排列(即,您不希望“this,is”和“is,this”有单独的条目。我们不关心顺序这里:我们只想找出每个可能的单词对在整个文本的同一行中出现的频率。
This particular problem is easy to solve, but I'm not so sure that the solution I'd arrive at would be computationally efficient. So I'm asking the experts!
What would be the best way to go through a large file, collecting stats (for the entire file) on how often two words occur in the same line?
For instance, if the text contained only the following two lines:
"This is the white baseball."
"These guys have white baseball bats."
You would end up collecting the following stats:
(this, is: 1), (this, the: 1), (this, white: 1), (this, baseball: 1), (is, the: 1), (is, white: 1), (is, baseball: 1) ... and so forth.
For the entry (baseball, white: 2), the value would be 2, since this pair of words occurs in the same line a total of 2 times.
Ideally, the stats should be placed in a dictionary, where the keys are alphabetized at the tuple level (i.e., you wouldn't want separate entries for "this, is" and "is, this." We don't care about order here: we just want to find how often each possible pair of words occurs in the same line throughout the text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)