当前位置：文江博客话题详情

算法找到最常见的单词

发布于 2025-01-28 13:16:52 字数 1455 浏览 2 评论 0 原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由如风 2025-02-04 13:16:52

将同义词文件解析为 dict ，将同义词映射到其主词。
如果有一个单词，则用它的主同义词替换，如果它具有一个单词，或者如果没有的话，请将其保留。如果 syn_dict 是同义词的dist，而 word 是一个词，则 syn_dict.get（word，word）将返回 word 的主要同义词如果在dict中，则返回 Word 否则。
使用 collections.counter 计算所有单词。
使用方法。

使用 stringio 模拟文本文件以供重复：

from io import StringIO
from collections import Counter

textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')

syn_dict = {}
for line in synonymfile:
    word, *synonyms = map(str.strip, line.split(','))
    for s in synonyms:
        syn_dict[s] = word

print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
#  'hours': 'house', 'how': 'house', 'housing': 'house'}

c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

使用实际文件，应该看起来像这样：

from collections import Counter

with open('file2.txt', 'r') as synonymfile:
    syn_dict = {}
    for line in synonymfile:
        word, *synonyms = map(str.strip, line.split(','))
        for s in synonyms:
            syn_dict[s] = word

with open('file1.txt', 'r') as textfile:
    c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

Parse the synonyms file into a dict that maps the synonyms to their main word.
Replace every word by its main synonym, if it has one, or keep it if it doesn't. If syn_dict is the dict of synonyms and word is a word, then syn_dict.get(word, word) will return word's main synonym if it's in the dict, or return word otherwise.
Use collections.Counter to count all the words.
Use method .most_common of Counter to extract the 100 most frequent words.

Emulating the text files with StringIO for reproducibility:

from io import StringIO
from collections import Counter

textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')

syn_dict = {}
for line in synonymfile:
    word, *synonyms = map(str.strip, line.split(','))
    for s in synonyms:
        syn_dict[s] = word

print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
#  'hours': 'house', 'how': 'house', 'housing': 'house'}

c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

With actual files it should look like this:

from collections import Counter

with open('file2.txt', 'r') as synonymfile:
    syn_dict = {}
    for line in synonymfile:
        word, *synonyms = map(str.strip, line.split(','))
        for s in synonyms:
            syn_dict[s] = word

with open('file1.txt', 'r') as textfile:
    c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())

print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
#  ('and', 1), ('with', 1), ('it', 1), ('has', 1),
#  ('been', 1), ('a', 1), ('for', 1)]

回复收藏 0 原文

~没有更多了~

关于作者

动听の歌

暂无简介

文章

27 人气

关注发私信

十二

文章 0 评论 0

关注

飞烟轻若梦

文章 0 评论 0

关注

OPleyuhuo

文章 0 评论 0

关注

wxb0109

文章 0 评论 0

关注

旧城空念

文章 0 评论 0

关注

-小熊_

文章 0 评论 0

友情链接

文江博客

算法找到最常见的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

算法找到最常见的单词

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。