Hadoop 流和 AMAZON EMR

发布于 2024-09-29 04:26:48 字数 1120 浏览 5 评论 0原文

我一直在尝试使用 Amazon EMR 中的 Hadoop 流对一堆文本文件进行简单的字数统计。为了掌握 hadoop 流和 Amazon 的 EMR,我还使用了一个非常简化的数据集。每个文本文件中只有一行文本(该行可以包含任意数量的单词)。

映射器是一个 R 脚本,它将行分割成单词并将其吐回到流中。

cat(wordList[i],"\t1\n")

我决定使用 LongValueSum Aggregate 缩减器将计数相加,因此我必须在映射器输出前加上 LongValueSum

cat( "LongValueSum:",wordList[i],"\t1\n")

并指定reducer为“aggregate”

我现在的问题如下:

  1. mapper和reducer之间的中间阶段,只是对流进行排序。它并没有真正通过按键组合。我说得对吗?我问这个问题是因为如果我不使用“LongValueSum”作为映射器输出的单词的前缀,那么在reducer处我只会收到按键排序的流,但不会聚合。也就是说,我只是收到按 K 排序的命令,而不是减速器处的 (K, list(Values)) 。我需要在命令中指定组合器吗?

  2. 如何使用其他骨料减速机。我看到,http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

AMAZON EMR 设置中如何指定这些组合器和减速器?

我相信此类问题已在组合器的 Hadoop 流中提交并修复,但我不确定 AMAZON EMR 托管的是哪个版本,以及该修复可用的版本。

  1. 自定义输入格式以及记录读取器和写入器怎么样?有很多用 Java 编写的库。为每个选项指定 java 类名是否足够?

I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).

The mapper is an R script, that splits the line into words and spits it back to the stream.

cat(wordList[i],"\t1\n")

I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum

cat("LongValueSum:",wordList[i],"\t1\n")

and specify the reducer to be "aggregate"

The questions I have now are the following:

  1. The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?

  2. How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

How are these combiners and reducer specified in an AMAZON EMR set up?

I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.

  1. How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

mapper和reducer之间的中间阶段,只是对流进行排序。它并没有真正通过按键组合。我说得对吗?

流中的aggregate减速器确实实现了相关的组合器接口,因此Hadoop会在认为合适的情况下使用它[1]

也就是说,我只是收到按 K 排序的命令,而不是减速器处的 (K, list(Values)) 。

通过流媒体接口,您始终会收到 K、V 值对;你永远不会收到(K,list(values))

如何使用其他骨料减速机。

您对其中哪一个不确定?您指定的链接有每个行为的快速摘要

我相信此类问题已被提交并修复

您在考虑什么问题?

不确定 AMAZON EMR 托管的是哪个版本

EMR 版本基于 Hadoop 0.20.2

为每个选项指定 java 类名是否足够?

你的意思是在流媒体的背景下吗?还是总体框架?

[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right?

The aggregate reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]

That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer.

With the streaming interface you always receive K,V value pairs; you'll never receive (K,list(values))

How are other aggregate reducers used.

Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each

I believe an issue of this kind has been filed and fixed

What issue are you thinking of?

not sure what version AMAZON EMR is hosting

EMR is based on Hadoop 0.20.2

Is it sufficient to specify the java class name for each of these options?

Do you mean in the context of streaming? or the aggregate framework?

[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文