hadoop:支持MapReduce作业的多个输出
似乎 Hadoop 支持它(参考),但我不知道如何使用它。
我想:
a.) Map - Read a huge XML file and load the relevant data and pass on to reduce
b.) Reduce - write two .sql files for different tables
为什么我选择 map/reduce 是因为我必须对磁盘上超过 100k(可能更多)的 xml 文件执行此操作。欢迎任何更好的建议。
任何解释如何使用它的资源/教程都值得赞赏。
我正在使用 Python
,并且想了解如何使用 streaming
实现此目的,
谢谢
Seems like it is supported in Hadoop
(reference), but I dont know how to use this.
I want to :
a.) Map - Read a huge XML file and load the relevant data and pass on to reduce
b.) Reduce - write two .sql files for different tables
Why I am choosing map/reduce is because I have to do this for over 100k(may be many more)
xml files residing ondisk. any better suggestions are welcome
Any resources/tutorials explaining how to use this is appreciated.
I am using Python
and would want to learn how to achieve this using streaming
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能不是一个优雅的解决方案,但您可以创建两个模板,以便在作业完成后将化简任务的输出转换为所需的格式。通过编写 shell 脚本可以实现很多自动化,该脚本将查找reduce 输出并在其上应用模板。使用 shell 脚本,转换按顺序发生,并且不关心集群中的 n 台机器。
或者,在减少任务中,您可以将两种输出格式创建到带有某些分隔符的单个文件中,并稍后使用分隔符将它们拆分。在这种方法中,由于转换发生在reduce中,因此转换分布在集群中的所有节点上。
Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.
Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.