Hadoop 流和 AMAZON EMR
我一直在尝试使用 Amazon EMR 中的 Hadoop 流对一堆文本文件进行简单的字数统计。为了掌握 hadoop 流和 Amazon 的 EMR,我还使用了一个非常简化的数据集。每个文本文件中只有一行文本(该行可以包含任意数量的单词)。
映射器是一个 R 脚本,它将行分割成单词并将其吐回到流中。
cat(wordList[i],"\t1\n")
我决定使用 LongValueSum Aggregate 缩减器将计数相加,因此我必须在映射器输出前加上 LongValueSum
cat( "LongValueSum:",wordList[i],"\t1\n")
并指定reducer为“aggregate”
我现在的问题如下:
mapper和reducer之间的中间阶段,只是对流进行排序。它并没有真正通过按键组合。我说得对吗?我问这个问题是因为如果我不使用“LongValueSum”作为映射器输出的单词的前缀,那么在reducer处我只会收到按键排序的流,但不会聚合。也就是说,我只是收到按 K 排序的命令,而不是减速器处的 (K, list(Values)) 。我需要在命令中指定组合器吗?
如何使用其他骨料减速机。我看到,http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
AMAZON EMR 设置中如何指定这些组合器和减速器?
我相信此类问题已在组合器的 Hadoop 流中提交并修复,但我不确定 AMAZON EMR 托管的是哪个版本,以及该修复可用的版本。
- 自定义输入格式以及记录读取器和写入器怎么样?有很多用 Java 编写的库。为每个选项指定 java 类名是否足够?
I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).
The mapper is an R script, that splits the line into words and spits it back to the stream.
cat(wordList[i],"\t1\n")
I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum
cat("LongValueSum:",wordList[i],"\t1\n")
and specify the reducer to be "aggregate"
The questions I have now are the following:
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer. Do I need to specify a combiner in my command?
How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
How are these combiners and reducer specified in an AMAZON EMR set up?
I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
- How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
流中的
aggregate
减速器确实实现了相关的组合器接口,因此Hadoop会在认为合适的情况下使用它[1]通过流媒体接口,您始终会收到 K、V 值对;你永远不会收到
(K,list(values))
您对其中哪一个不确定?您指定的链接有每个行为的快速摘要
您在考虑什么问题?
EMR 版本基于 Hadoop 0.20.2
你的意思是在流媒体的背景下吗?还是总体框架?
[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
The
aggregate
reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]With the streaming interface you always receive K,V value pairs; you'll never receive
(K,list(values))
Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each
What issue are you thinking of?
EMR is based on Hadoop 0.20.2
Do you mean in the context of streaming? or the aggregate framework?
[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html