如何在mrjob中获得最长的单词

发布于 2025-01-24 21:16:35 字数 747 浏览 4 评论 0原文

我正在尝试通过字母a-＆gt; z在文本文件中找到最长的单词。

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word[0].lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, _, word_count_pairs):
        longest_word = ''
        for word in word_count_pairs:
            if len(word) > len (longest_word):
                longest_word = word
        yield max(longest_word)

if __name__ == '__main__':
    MRWordFreqCount.run()

排名应该是这样的，但我被困在这里

"r" ["recommendations", "representations"]

"s" ["superciliousness"]

原文

I'm trying to find the longest word in the text file through letter a->z.

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word[0].lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, _, word_count_pairs):
        longest_word = ''
        for word in word_count_pairs:
            if len(word) > len (longest_word):
                longest_word = word
        yield max(longest_word)

if __name__ == '__main__':
    MRWordFreqCount.run()

The out put should be something like this but I'm getting stuck here

"r" ["recommendations", "representations"]

"s" ["superciliousness"]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ㄟ。诗瑗 2025-01-31 21:16:35

您的映射器当前仅输出每个单词的第一个字符。

然后，您的组合制造商计算了几个单词以那封信开头...这不会帮助找到整个单词的最大值。

则不会帮助找到最长的单词

问题的一部分 - max（）仅在数字上起作用，仅返回一个值，因此如果您不关心领先的字母，，然后MapReduce并不是真正有益的，因为您需要将所有单词强加于一个还原器 - 例如下面。另外，对于非常大的文件，这不是建议的方法，

def mapper(self, _, line):
    for word in WORD_RE.findall(line):
        yield None, word

def reducer(self, _, words):
    lst = list(words)  # copy out iterator to in memory list 
    lens = max(len(w) for w in words)
    max_words = [w for w in words if len(w) == max_words] 
    yield None, max_words

上面的替代策略是每个字母的最大长度单词，然后在此之后，如果您想找到总最大值，请将输出传递给辅助MapReduce作业

Your mapper is currently outputting only the first character of each word.

Your combiner is then counting how many words start with that letter... That's not going to help find a max of the whole word.

Part of the problem - max() only works on numbers only returns one value, so won't help find longest words that are all the same length

If you don't care about the leading letters, then mapreduce isn't really beneficial since you would need to force all words into one reducer- for example below. Also, this is not recommended approach for very large files

def mapper(self, _, line):
    for word in WORD_RE.findall(line):
        yield None, word

def reducer(self, _, words):
    lst = list(words)  # copy out iterator to in memory list 
    lens = max(len(w) for w in words)
    max_words = [w for w in words if len(w) == max_words] 
    yield None, max_words

The alternative strategy to above is to find the max lengths words per letter, then after that, if you want to find the overall max, pass the output to a secondary mapreduce job

回复收藏 0 原文

~没有更多了~