如何使外部引用表或数据库可供 Hadoop MapReduce 作业使用?
我正在 Hadoop MapReduce 作业中分析大量文件,输入文件为 .txt 格式。我的映射器和减速器都是用 Python 编写的。
但是,我的映射器模块需要访问外部 csv 文件的内容,该文件基本上只是一个大表,用于查找映射器正在执行的转换的参考值。
到目前为止,我只是让映射器将文件从本地目录加载到内存中,以使其可用作 Python 变量。不过,由于文件很大(几千行和几千列),因此加载时间相对较长(大约 10 秒,对于我的目的来说太长了)。问题是 Hadoop 似乎会为每个新输入文件重新执行映射器脚本,或者它将大输入文件拆分为较小的文件,导致每次新输入时我的 csv 文件都不必要地一次又一次加载到内存中 -文件已处理。
有没有办法让 Hadoop 仅加载一次文件并以某种方式使其“全局”可用?在谷歌上搜索诸如 Hive、Pig、sqlite 之类的名字,但我从未见过任何例子来检查它们是否真的对这个目的有用。
基本上,我只需要在运行 Hadoop 作业时快速访问某种数据库或字典。我的参考表的格式不一定是 CSV,我可以非常灵活地将数据转换为不同的格式。
I am analyzing a large amount of files in a Hadoop MapReduce job, with the input files being in .txt format. Both my mapper and my reducer are written in Python.
However, my mapper module requires access to the contents of an external csv-file, which is basically just a large table to look up reference values for a transformation that the mapper is performing.
Up until now, I just had the mapper load the file into memory from a local directory to make it available as a Python variable. Since the file is quite large, though (several thousand rows and columns), it takes a relatively long time to be loaded (about 10 seconds, too long for my purposes). The problem is that Hadoop seems to re-execute the mapper-script for every new input-file or it splits large input files into smaller ones, causing my csv-file to be unnecessarily loaded into memory again and again each time a new input-file is processed.
Is there a way to have Hadoop load the file only once and somehow make it "globally" available? Upon googling names like Hive, Pig, sqlite were popping up, but I never saw any examples to check if these are actually useful for this purpose.
Basically, I would just need some kind of database or dictionary to be accessed quickly while running my Hadoop job. The format of my reference table, doesn't have to be CSV, I am pretty flexible in transforming that data to different formats.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,请查看 -files 选项 添加到您的 hadoop 流命令行。这将获取您已加载到 HDFS 中的文件,并在每个任务跟踪器节点上本地缓存该文件的一个副本,并为每个映射器和减速器任务的 CWD 建立一个软链接。
如果您想将 jar 与您的作业捆绑在一起,还有 -archives 选项。
Yes, look into the -files option to your hadoop streaming command line. That will take a file you have loaded into HDFS and cache one copy of it locally on each tasktracker node and make a softlink to each mapper and reducer task's CWD.
There is also the -archives option if you have jars that you want to bundle with you job.
您可能应该看看 Sqoop。它将数据从数据库导入 HDFS,以便您可以使用 MapReduce 处理它。
You should probably take a look at Sqoop. It imports you data from a database into HDFS so that you can process it using Map Reduce..