如果有一个单词,则用它的主同义词替换,如果它具有一个单词,或者如果没有的话,请将其保留。如果 syn_dict 是同义词的dist,而 word 是一个词,则 syn_dict.get(word,word)将返回 word 的主要同义词如果在dict中,则返回 Word 否则。
使用 collections.counter 计算所有单词。
使用方法。
使用 stringio 模拟文本文件以供重复:
from io import StringIO
from collections import Counter
textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')
syn_dict = {}
for line in synonymfile:
word, *synonyms = map(str.strip, line.split(','))
for s in synonyms:
syn_dict[s] = word
print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
# 'hours': 'house', 'how': 'house', 'housing': 'house'}
c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())
print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
# ('and', 1), ('with', 1), ('it', 1), ('has', 1),
# ('been', 1), ('a', 1), ('for', 1)]
使用实际文件,应该看起来像这样:
from collections import Counter
with open('file2.txt', 'r') as synonymfile:
syn_dict = {}
for line in synonymfile:
word, *synonyms = map(str.strip, line.split(','))
for s in synonyms:
syn_dict[s] = word
with open('file1.txt', 'r') as textfile:
c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())
print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
# ('and', 1), ('with', 1), ('it', 1), ('has', 1),
# ('been', 1), ('a', 1), ('for', 1)]
Parse the synonyms file into a dict that maps the synonyms to their main word.
Replace every word by its main synonym, if it has one, or keep it if it doesn't. If syn_dict is the dict of synonyms and word is a word, then syn_dict.get(word, word) will return word's main synonym if it's in the dict, or return word otherwise.
Use collections.Counter to count all the words.
Use method .most_common of Counter to extract the 100 most frequent words.
Emulating the text files with StringIO for reproducibility:
from io import StringIO
from collections import Counter
textfile = StringIO('''The house is comfy and complete with computers
It has been housing a computer for hours''')
synonymfile = StringIO('''computer, comfy, complete, computers
house, hours, how, housing''')
syn_dict = {}
for line in synonymfile:
word, *synonyms = map(str.strip, line.split(','))
for s in synonyms:
syn_dict[s] = word
print(syn_dict)
# {'comfy': 'computer', 'complete': 'computer', 'computers': 'computer',
# 'hours': 'house', 'how': 'house', 'housing': 'house'}
c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())
print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
# ('and', 1), ('with', 1), ('it', 1), ('has', 1),
# ('been', 1), ('a', 1), ('for', 1)]
With actual files it should look like this:
from collections import Counter
with open('file2.txt', 'r') as synonymfile:
syn_dict = {}
for line in synonymfile:
word, *synonyms = map(str.strip, line.split(','))
for s in synonyms:
syn_dict[s] = word
with open('file1.txt', 'r') as textfile:
c = Counter(syn_dict.get(word, word) for line in textfile for word in line.lower().split())
print( c.most_common(100) )
# [('computer', 4), ('house', 3), ('the', 1), ('is', 1),
# ('and', 1), ('with', 1), ('it', 1), ('has', 1),
# ('been', 1), ('a', 1), ('for', 1)]
发布评论
评论(1)
dict
,将同义词映射到其主词。syn_dict
是同义词的dist,而word
是一个词,则syn_dict.get(word,word)
将返回word的主要同义词如果在dict中,则返回
Word
否则。collections.counter
计算所有单词。。
使用
stringio
模拟文本文件以供重复:使用实际文件,应该看起来像这样:
dict
that maps the synonyms to their main word.syn_dict
is the dict of synonyms andword
is a word, thensyn_dict.get(word, word)
will returnword
's main synonym if it's in the dict, or returnword
otherwise.collections.Counter
to count all the words..most_common
ofCounter
to extract the 100 most frequent words.Emulating the text files with
StringIO
for reproducibility:With actual files it should look like this: