如何消除hadoop mapreduce中的重复文件名?
我想消除 hadoop mapreduce 倒排索引程序输出中的重复文件名。例如,输出类似于 - things : doc1,doc1,doc1,doc2 但我希望它像 事物:doc1,doc2
I want to eliminate duplicate filenames in my output of the hadoop mapreduce inverted index program. For example, the output is like - things : doc1,doc1,doc1,doc2 but I want it to be like
things : doc1,doc2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
那么您想要删除映射的重复项,即您想要将中间值列表减少为没有重复项的输出列表。我最好的选择是简单地将
reduce()
方法中的Iterator
转换为 javaSet
并对其进行迭代更改:不幸的是,
我不知道有什么更好(更简洁)的方法将迭代器转换为集合。
这应该比橙色的解决方案具有更小的时间复杂度,但内存消耗更高。
@编辑:短一点:
包含应该是(就像添加一样)恒定时间,所以现在应该是 O(n) 。
Well you want to remove duplicates which were mapped, i.e. you want to reduce the intermediate value list to an output list with no duplicates. My best bet would be to simply convert the
Iterator<Text>
in thereduce()
method to a javaSet
and iterate over it changing:To something like:
Unfortunately I do not know of any better (more concise) way of converting an Iterator to a Set.
This should have a smaller time complexity than orange's solution but a higher memory consumption.
@Edit: a bit shorter:
Contains should be (just like add) constant time so it should be O(n) now.
要以最少的代码更改来做到这一点,只需添加一个 if 语句来检查您要附加的内容是否已在
toReturn
:gets 更改为
上面的解决方案有点慢是因为它每次都必须遍历整个字符串来查看该字符串是否存在。可能最好的方法是使用
HashSet
来收集项目,然后将HashSet
中的值组合成最终输出字符串。To do this with the minimal amount of code change, just add an if-statement that checks to see if the thing you are about to append is already in
toReturn
:gets changed to
The above solution is a bit slow because it has to traverse the entire string every time to see if that string is there. Likely the best way to do this is to use a
HashSet
to collect the items, then combining the values in theHashSet
into a final output string.