算法找到最常见的单词

发布于 2025-01-28 13:16:52 字数 1455 浏览 2 评论 0 原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

自由如风 2025-02-04 13:16:52
  • 将同义词文件解析为 dict ,将同义词映射到其主词。
  • 如果有一个单词,则用它的主同义词替换,如果它具有一个单词,或者如果没有的话,请将其保留。如果 syn_dict 是同义词的dist,而 word 是一个词,则 syn_dict.get(word,word)将返回 word 的主要同义词如果在dict中,则返回 Word 否则。
  • 使用 collections.counter 计算所有单词。
  • 使用方法

使用 stringio 模拟文本文件以供重复:

from io import StringIO
from collections import Counter

textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')

syn_dict = {}
for line in synonymfile:
    word, *synonyms = map(str.strip, line.split(','))
    for s in synonyms:
        syn_dict[s] = word

print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
#  'hours': 'house', 'how': 'house', 'housing': 'house'}

c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

使用实际文件,应该看起来像这样:

from collections import Counter

with open('file2.txt', 'r') as synonymfile:
    syn_dict = {}
    for line in synonymfile:
        word, *synonyms = map(str.strip, line.split(','))
        for s in synonyms:
            syn_dict[s] = word

with open('file1.txt', 'r') as textfile:
    c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]
  • Parse the synonyms file into a dict that maps the synonyms to their main word.
  • Replace every word by its main synonym, if it has one, or keep it if it doesn't. If syn_dict is the dict of synonyms and word is a word, then syn_dict.get(word, word) will return word's main synonym if it's in the dict, or return word otherwise.
  • Use collections.Counter to count all the words.
  • Use method .most_common of Counter to extract the 100 most frequent words.

Emulating the text files with StringIO for reproducibility:

from io import StringIO
from collections import Counter

textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')

syn_dict = {}
for line in synonymfile:
    word, *synonyms = map(str.strip, line.split(','))
    for s in synonyms:
        syn_dict[s] = word

print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
#  'hours': 'house', 'how': 'house', 'housing': 'house'}

c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

With actual files it should look like this:

from collections import Counter

with open('file2.txt', 'r') as synonymfile:
    syn_dict = {}
    for line in synonymfile:
        word, *synonyms = map(str.strip, line.split(','))
        for s in synonyms:
            syn_dict[s] = word

with open('file1.txt', 'r') as textfile:
    c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文