查找前十个单词的音节数
我正在尝试做一个工作,接收一个文本文件,然后计算每个单词的音节数,然后最终返回音节最多的前 10 个单词。我能够按降序排列所有单词/音节对,但是,我很难弄清楚如何只返回前 10 个单词。到目前为止,这是我的代码:
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")
class MRMostUsedWordSyllables(MRJob):
def steps(self):
return [
MRStep(mapper=self.word_splitter_mapper,
reducer=self.sorting_word_syllables),
MRStep(reducer=self.reducer_word_sorted),
MRStep(reducer=self.get_top_10_reducer)
]
def word_splitter_mapper(self, _, line):
#for word in line.split():
for word in WORD_RE.findall(line):
yield(word.lower(), None)
def sorting_word_syllables(self, word, count):
count = 0
vowels = 'aeiouy'
word = word.lower().strip()
if word in vowels:
count +=1
for index in range(1,len(word)):
if word[index] in vowels and word[index-1] not in vowels:
count +=1
if word.endswith('e'):
count -= 1
if word.endswith('le'):
count+=1
if count == 0:
count +=1
yield None, (int(count), word)
def reducer_word_sorted(self, _, syllables_counts):
for count, word in sorted(syllables_counts, reverse=True):
yield (int(count), word)
def get_top_10_reducer(self, count, word):
self.aList = []
for value in list(range(count)):
self.aList.append(value)
self.bList = []
for i in range(10):
self.bList.append(max(self.aList))
self.aList.remove(max(self.aList))
for i in range(10):
yield self.bList[i]
if __name__ == '__main__':
import time
start = time.time()
MRMostUsedWordSyllables.run()
end = time.time()
print(end - start)
我知道我的问题在于“get_top_10_reducer”函数。我不断收到 ValueError: max() arg is anemptyequence
。
I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I'm able to get all of the word/syllable pairs sorted in descending order, however, I am struggling to figure out how to return only the top 10 words. Here's my code so far:
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")
class MRMostUsedWordSyllables(MRJob):
def steps(self):
return [
MRStep(mapper=self.word_splitter_mapper,
reducer=self.sorting_word_syllables),
MRStep(reducer=self.reducer_word_sorted),
MRStep(reducer=self.get_top_10_reducer)
]
def word_splitter_mapper(self, _, line):
#for word in line.split():
for word in WORD_RE.findall(line):
yield(word.lower(), None)
def sorting_word_syllables(self, word, count):
count = 0
vowels = 'aeiouy'
word = word.lower().strip()
if word in vowels:
count +=1
for index in range(1,len(word)):
if word[index] in vowels and word[index-1] not in vowels:
count +=1
if word.endswith('e'):
count -= 1
if word.endswith('le'):
count+=1
if count == 0:
count +=1
yield None, (int(count), word)
def reducer_word_sorted(self, _, syllables_counts):
for count, word in sorted(syllables_counts, reverse=True):
yield (int(count), word)
def get_top_10_reducer(self, count, word):
self.aList = []
for value in list(range(count)):
self.aList.append(value)
self.bList = []
for i in range(10):
self.bList.append(max(self.aList))
self.aList.remove(max(self.aList))
for i in range(10):
yield self.bList[i]
if __name__ == '__main__':
import time
start = time.time()
MRMostUsedWordSyllables.run()
end = time.time()
print(end - start)
I know my issue is with the "get_top_10_reducer" function. I keep getting ValueError: max() arg is an empty sequence
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据错误,您的减速器之一已返回
0
计数。例如,您的输入中是否有空行?您应该尽早过滤掉这些数据。总的来说,我认为您需要删除
reducer_word_sorted
。不保证这会返回排序的数据。相反,我认为它根据数字count
键重新组合所有数据,然后以不确定的顺序发送到下一步。话虽如此,您的前 10 个减速器永远不会使用
word
参数的值,它本身应该是一个列表本身,实际上,按前一个减速器发出的每个count
键进行分组。删除
reducer_word_sorted
后,sorting_word_syllables
为其键返回None
...这很好,因为您可以将所有拆分单词放在一个巨大的列表中,所以定义一个常规函数在reducer中使用它
According to the error, one of your reducers has returned
0
for the count. Do you have an empty line in your input, for example? You should filter this data out as early as possible.Overall, I think you need to remove
reducer_word_sorted
. There is no guarantee this returns sorted data. Instead, I think it regroups all data based on the numericcount
key, then emits in a non-deterministic order to the next step.That being said, your top 10 reducer is never using the value of
word
parameter , which should be a list itself, actually, grouped by eachcount
key emitted by the previous reducer.With the
reducer_word_sorted
removed, thesorting_word_syllables
returnsNone
for its key... This is fine because you then have all split words in a giant list, so define a regular functionUse that within the reducer