在EC2上运行mapreduce作业时如何获取文件名？

发布于 2024-12-14 09:02:45 字数 577 浏览 0 评论 0 原文

我正在学习弹性映射缩减，并从亚马逊教程部分提供的分词器示例开始（代码如下所示）。该示例生成所提供的所有输入文档中所有单词的字数统计。

但我想按文件名获取字数统计的输出，即一个特定文档中单词的计数。由于用于字数统计的 python 代码从 stdin 获取输入，我如何判断哪个输入行来自哪个文档？

谢谢。

#!/usr/bin/python

import sys
import re

def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in  pattern.findall(line):
        print  "LongValueSum:" + word.lower() + "\t" + "1"
      line =  sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

原文

I am learning elastic mapreduce and started off with the Word Splitter example provided in the Amazon Tutorial Section(code shown below). The example produces word count for all the words in all the input documents provided.

But I want to get output for Word Counts by file names i.e the count of a word in just one particular document. Since the python code for word count takes input from stdin, how do I tell which input line came from which document ?

Thanks.

#!/usr/bin/python

import sys
import re

def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in  pattern.findall(line):
        print  "LongValueSum:" + word.lower() + "\t" + "1"
      line =  sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感情废物 2024-12-21 09:02:45

在典型的 WordCount 示例中，映射文件正在处理的文件名将被忽略，因为作业输出包含所有输入文件的合并字数统计，而不是文件级别的字数统计。但要获取文件级别的字数，必须使用输入文件名。使用 Python 的映射器可以使用 os.environ["map.input.file"] 命令获取文件名。任务执行环境变量列表位于此处。

映射器不仅仅将键/值对发出为，还应该包含正在处理的输入文件名。以下可以是映射 > 发出的内容，其中 input.txt 是键， code> 是值。

现在，特定文件的所有字数统计都将由单个缩减器处理。然后，reducer 必须聚合该特定文件的字数。

像往常一样，组合器将有助于减少映射器和减速器之间的网络干扰，并更快地完成工作。

查看使用 MapReduce 进行数据密集型文本处理，了解更多算法文本处理。