您建议如何执行“加入”？使用 Hadoop 流式传输？

发布于 2024-10-02 10:23:22 字数 266 浏览 0 评论 0原文

我有两个文件，格式如下：

field1, field2, field3
field4, field1, field5

不同的字段编号表示不同的含义。

我想使用基于共同字段（上例中的 field1）的 Hadoop Streaming 连接两个文件，因此输出将为 field1, field2, field3, field4, field5 （其他顺序也可以，只要它们具有所有字段）。

原文

I have two files, in the following formats:

field1, field2, field3
field4, field1, field5

A different field number indicates a different meaning.

I want to join the two files using Hadoop Streaming based on the mutual field (field1 in the above example) so the output will be field1, field2, field3, field4, field5 (other orderings are ok as along as they have all the fields).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱要勇敢去追 2024-10-09 10:23:22

Hadoop 有一个名为 KeyFieldBasedPartitioner 的库 http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html

使用此选项作为作业启动中的分区程序对于流作业，您可以将映射器输出分解为键/值对，并将键一起散列到相同的减速器并排序，包括值 http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#更多+用法+示例

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D mapreduce.map.output.key.field.separator=. \
-D mapreduce.partition.keypartitioner.options=-k1,2 \
-D mapreduce.job.reduces=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

这里，-Dstream.map.output.field.separator=。和 -D stream.num.map.output.key.fields=4 在这里进行了解释 http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#Customizing+How+Lines+are+ Split+into+Key%2FValue+Pairs 基本上，它们是您输出映射器字段来定义键/值对的方式。

上述MapReduce作业的映射输出键通常有四个字段，以“.”分隔。但是，MapReduce 框架将使用 -D mapreduce.partition.keypartitioner.options=-k1,2 选项按键的前两个字段对映射输出进行分区。这里，-D mapreduce.map.output.key.field.separator=。指定分区的分隔符。这保证了键中具有相同前两个字段的所有键/值对将被分区到相同的reducer中。

这实际上相当于将前两个字段指定为主键，将接下来的两个字段指定为辅助键。主键用于分区，主键和辅助键的组合用于排序。

为了进行连接，它就像从映射器输出字段并在配置启动时为作为键的字段设置选项一样简单，并且减速器将通过键适当地连接所有值。如果您想从多个源获取数据，只需在命令行上继续添加更多输入...如果它们的输入长度不同，那么在您的映射器中您可以识别它并从映射器创建标准格式输出。

Hadoop has a library called KeyFieldBasedPartitioner http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/KeyFieldBasedPartitioner.html

Using this as an option in your job launch as the partitioner for your streaming job allows you to break your mapper output into Key/Value pairs and have the keys get hashed up together going to the same reducer and sorting including the values http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#More+Usage+Examples

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D mapreduce.map.output.key.field.separator=. \
-D mapreduce.partition.keypartitioner.options=-k1,2 \
-D mapreduce.job.reduces=12 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Here, -D stream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are explained here http://hadoop.apache.org/mapreduce/docs/r0.21.0/streaming.html#Customizing+How+Lines+are+Split+into+Key%2FValue+Pairs basically they are how you have outputted your mapper fields to define the key/value pairs.

The map output keys of the above MapReduce job normally have four fields separated by ".". However, the MapReduce framework will partition the map outputs by the first two fields of the keys using the -D mapreduce.partition.keypartitioner.options=-k1,2 option. Here, -D mapreduce.map.output.key.field.separator=. specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.

This is effectively equivalent to specifying the first two fields as the primary key and the next two fields as the secondary. The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.

In order to-do a join it is as simple as outputting the fields from your mapper and setting the options on your configuration launch for the fields that are the keys and the reducer will have all of your values joined by key appropriately. If you want to take data from multiple sources just keep adding more -input on the command line... if they are different input lengths then in your mapper you can recognize that and create a standard format output from mapper.

回复收藏 0 原文

~没有更多了~