如何在mrjob中获得最长的单词
我正在尝试通过字母a-> z在文本文件中找到最长的单词。
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word[0].lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, _, word_count_pairs):
longest_word = ''
for word in word_count_pairs:
if len(word) > len (longest_word):
longest_word = word
yield max(longest_word)
if __name__ == '__main__':
MRWordFreqCount.run()
排名应该是这样的,但我被困在这里
"r" ["recommendations", "representations"]
"s" ["superciliousness"]
I'm trying to find the longest word in the text file through letter a->z.
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield word[0].lower(), 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer(self, _, word_count_pairs):
longest_word = ''
for word in word_count_pairs:
if len(word) > len (longest_word):
longest_word = word
yield max(longest_word)
if __name__ == '__main__':
MRWordFreqCount.run()
The out put should be something like this but I'm getting stuck here
"r" ["recommendations", "representations"]
"s" ["superciliousness"]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的映射器当前仅输出每个单词的第一个字符。
然后,您的组合制造商计算了几个单词以那封信开头...这不会帮助找到整个单词的最大值。
则不会帮助找到最长的单词
问题的一部分 -
max()
仅在数字上起作用,仅返回一个值,因此如果您不关心领先的字母, ,然后MapReduce并不是真正有益的,因为您需要将所有单词强加于一个还原器 - 例如下面。另外,对于非常大的文件,这不是建议的方法,上面的替代策略是每个字母的最大长度单词,然后在此之后,如果您想找到总最大值,请将输出传递给辅助MapReduce作业
Your mapper is currently outputting only the first character of each word.
Your combiner is then counting how many words start with that letter... That's not going to help find a max of the whole word.
Part of the problem -
max()
only works on numbers only returns one value, so won't help find longest words that are all the same lengthIf you don't care about the leading letters, then mapreduce isn't really beneficial since you would need to force all words into one reducer- for example below. Also, this is not recommended approach for very large files
The alternative strategy to above is to find the max lengths words per letter, then after that, if you want to find the overall max, pass the output to a secondary mapreduce job