查找前十个单词的音节数

发布于 2025-01-12 12:20:26 字数 2015 浏览 4 评论 0原文

我正在尝试做一个工作,接收一个文本文件,然后计算每个单词的音节数,然后最终返回音节最多的前 10 个单词。我能够按降序排列所有单词/音节对,但是,我很难弄清楚如何只返回前 10 个单词。到目前为止,这是我的代码:

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")

class MRMostUsedWordSyllables(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.word_splitter_mapper,
                   reducer=self.sorting_word_syllables),
            MRStep(reducer=self.reducer_word_sorted),
            MRStep(reducer=self.get_top_10_reducer)
        ]
    
    def word_splitter_mapper(self, _, line):
        #for word in line.split():
        for word in WORD_RE.findall(line):
            yield(word.lower(), None)
        
    def sorting_word_syllables(self, word, count):
        count = 0
        vowels = 'aeiouy'
        word = word.lower().strip()
        if word in vowels:
            count +=1
        for index in range(1,len(word)):
            if word[index] in vowels and word[index-1] not in vowels:
                count +=1
        if word.endswith('e'):
            count -= 1
        if word.endswith('le'):
            count+=1
        if count == 0:
            count +=1
        yield None, (int(count), word)
    
    
    
    def reducer_word_sorted(self, _, syllables_counts):
        for count, word in sorted(syllables_counts, reverse=True):
            yield (int(count), word)
            
    def get_top_10_reducer(self, count, word):
        self.aList = []
        for value in list(range(count)):
            self.aList.append(value)
        self.bList = []
        for i in range(10):
            self.bList.append(max(self.aList))
            self.aList.remove(max(self.aList))
        for i in range(10):
            yield self.bList[i]


if __name__ == '__main__':
   import time
   start = time.time()
   MRMostUsedWordSyllables.run()
   end = time.time()
   print(end - start)

我知道我的问题在于“get_top_10_reducer”函数。我不断收到 ValueError: max() arg is anemptyequence

I am trying to make a job that takes in a text file, then counts the number of syllables in each word, then ultimately returns the top 10 words with the most syllables. I'm able to get all of the word/syllable pairs sorted in descending order, however, I am struggling to figure out how to return only the top 10 words. Here's my code so far:

from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")

class MRMostUsedWordSyllables(MRJob):
    
    def steps(self):
        return [
            MRStep(mapper=self.word_splitter_mapper,
                   reducer=self.sorting_word_syllables),
            MRStep(reducer=self.reducer_word_sorted),
            MRStep(reducer=self.get_top_10_reducer)
        ]
    
    def word_splitter_mapper(self, _, line):
        #for word in line.split():
        for word in WORD_RE.findall(line):
            yield(word.lower(), None)
        
    def sorting_word_syllables(self, word, count):
        count = 0
        vowels = 'aeiouy'
        word = word.lower().strip()
        if word in vowels:
            count +=1
        for index in range(1,len(word)):
            if word[index] in vowels and word[index-1] not in vowels:
                count +=1
        if word.endswith('e'):
            count -= 1
        if word.endswith('le'):
            count+=1
        if count == 0:
            count +=1
        yield None, (int(count), word)
    
    
    
    def reducer_word_sorted(self, _, syllables_counts):
        for count, word in sorted(syllables_counts, reverse=True):
            yield (int(count), word)
            
    def get_top_10_reducer(self, count, word):
        self.aList = []
        for value in list(range(count)):
            self.aList.append(value)
        self.bList = []
        for i in range(10):
            self.bList.append(max(self.aList))
            self.aList.remove(max(self.aList))
        for i in range(10):
            yield self.bList[i]


if __name__ == '__main__':
   import time
   start = time.time()
   MRMostUsedWordSyllables.run()
   end = time.time()
   print(end - start)

I know my issue is with the "get_top_10_reducer" function. I keep getting ValueError: max() arg is an empty sequence.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

命硬 2025-01-19 12:20:26

根据错误,您的减速器之一已返回 0 计数。例如,您的输入中是否有空行?您应该尽早过滤掉这些数据。


总的来说,我认为您需要删除 reducer_word_sorted。不保证这会返回排序的数据。相反,我认为它根据数字 count 键重新组合所有数据,然后以不确定的顺序发送到下一步。

话虽如此,您的前 10 个减速器永远不会使用 word 参数的值,它本身应该是一个列表本身,实际上,按前一个减速器发出的每个 count 键进行分组。

删除reducer_word_sorted后,sorting_word_syllables为其键返回None...这很好,因为您可以将所有拆分单词放在一个巨大的列表中,所以定义一个常规函数

def get_syllable_count_pair(word):
  return (syllables(word), word, )

在reducer中使用它

def get_top_10_reducer(self, count, word):
  assert count == None  # added for a guard
  with_counts = [get_syllable_count_pair(w) for w in word]
  # Sort the words by the syllable count
  sorted_counts = sorted(syllables_counts, reverse=True, key=lambda x: x[0])
  # Slice off the first ten
  for t in sorted_counts[:10]: 
    yield t

According to the error, one of your reducers has returned 0 for the count. Do you have an empty line in your input, for example? You should filter this data out as early as possible.


Overall, I think you need to remove reducer_word_sorted. There is no guarantee this returns sorted data. Instead, I think it regroups all data based on the numeric count key, then emits in a non-deterministic order to the next step.

That being said, your top 10 reducer is never using the value of word parameter , which should be a list itself, actually, grouped by each count key emitted by the previous reducer.

With the reducer_word_sorted removed, the sorting_word_syllables returns None for its key... This is fine because you then have all split words in a giant list, so define a regular function

def get_syllable_count_pair(word):
  return (syllables(word), word, )

Use that within the reducer

def get_top_10_reducer(self, count, word):
  assert count == None  # added for a guard
  with_counts = [get_syllable_count_pair(w) for w in word]
  # Sort the words by the syllable count
  sorted_counts = sorted(syllables_counts, reverse=True, key=lambda x: x[0])
  # Slice off the first ten
  for t in sorted_counts[:10]: 
    yield t
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文