Mapreduce值列表顺序问题
正如我们所知,Hadoop 按每个键对值进行分组,并将它们发送到相同的reduce 任务。 假设我在 hdfs 上的文件中有下一行。 第1行 第2行 3号线 .... 亚麻布 在地图任务中,我打印文件名和行。 在reduce中,我收到不同的订单。例如 key=> { 第 3 行,第 1 行,第 2 行,....} 现在,我有下一个问题。我想获取这个值列表以便它们位于文件中, 作为 key =>{ line1, line2,...linen} 有什么办法可以做到这一点吗?
As we know Hadoop groups values with per key and sends them to same reduce task.
Suppose I have next lines in file on hdfs.
line1
line2
line3
....
linen
In map task I print filename and line.
In reduce I receive in different orders.for examle key=> { line3, line1, line2,....}
Now, I have the next problem. I want to get this value list in order that they lie in file,
as key =>{ line1, line2,...linen}
Is there any way of doing this ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您使用
TextInputFormat
,您将获得
作为映射器输入。 LongWritable 部分(或键)是文件中行的位置(不是行号,而是我认为从文件开头开始的位置)。您可以使用该部分来跟踪哪一行是第一行。例如,映射器可以输出
作为输出,而不是像您现在所做的那样
。然后,您可以根据对的第一部分(位置)对减速器获取的键进行排序,并且您应该以相同的顺序返回行。If you are using
TextInputFormat
, you get a<LongWritable, Text>
as mapper input. TheLongWritable
part (or the key) is the position of the line in the file (Not line number, but position from start of file I think). You can use that part to keep track of which line was first. For example, the mapper can output<Filename, TextPair(Position, Line)>
as output instead of<Filename, Line>
as you are doing now. Then you can sort the keys that the reducer gets based on the first part of the Pair (the Position) and you should get back the lines in the same order.