运行 Hadoop MapReduce 作业时如何获取文件名/文件内容作为 MAP 的键/值输入?
我正在创建一个程序来分析 PDF、DOC 和 DOCX 文件。这些文件存储在 HDFS 中。
当我开始 MapReduce 作业时,我希望映射函数将文件名作为键,将二进制内容作为值。然后我想创建一个流阅读器,我可以将其传递给 PDF 解析器库。如何实现映射阶段的键/值对是文件名/文件内容?
我正在使用 Hadoop 0.20.2
这是启动作业的旧代码:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(PdfReader.class);
conf.setJobName("pdfreader");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
我知道还有其他输入格式类型。但有没有一个完全符合我的要求呢?我发现文档相当模糊。如果有可用的,那么 Map 函数输入类型应该是什么样的?
提前致谢!
I am creating a program to analyze PDF, DOC and DOCX files. These files are stored in HDFS.
When I start my MapReduce job, I want the map function to have the Filename as key and the Binary Contents as value. I then want to create a stream reader which I can pass to the PDF parser library. How can I achieve that the key/value pair for the Map Phase is filename/filecontents?
I am using Hadoop 0.20.2
This is older code that starts a job:
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(PdfReader.class);
conf.setJobName("pdfreader");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
I Know there are other inputformat types. But is there one that does exactly what I want? I find the documentation quite vague. If there is one available, then how should the Map function input types look?
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
解决方案是创建您自己的 FileInputFormat 类来执行此操作。
您可以从此 FileInputFormat 接收的 FileSplit (getPath) 访问输入文件的名称。
请务必否决 FileInputformat 的 isSplitable 以始终返回 false。
您还需要一个自定义 RecordReader,它将整个文件作为单个“Record”值返回。
处理太大的文件时要小心。您将有效地将整个文件加载到 RAM 中,任务跟踪器的默认设置是只有 200MB 可用 RAM。
The solution to this is to create your own FileInputFormat class that does this.
You have access to the name of the input file from the FileSplit that this FileInputFormat receives (getPath).
Be sure to overrule the isSplitable of your FileInputformat to always return false.
You will also need a custom RecordReader that returns the entire file as a single "Record" value.
Be careful in handling files that are too big. You will effectively load the entire file into RAM and the default setting for a task tracker is to have only 200MB RAM available.
作为您的方法的替代方案,可以直接将二进制文件添加到 hdfs。然后,创建一个包含所有二进制文件的 dfs 路径的输入文件。这可以使用 Hadoop 的文件系统< /a> 类。最后,再次使用文件系统,创建一个通过打开输入流来处理输入的映射器。
As an alternative to your approach, maybe add the binary files to hdfs directly. Then, create an input file that contains the dfs paths for the all the binary files. This could be done dynamically using Hadoop's FileSystem class. Lastly, create a mapper that processes the input by opening input streams, again using FileSystem.
您可以使用 WholeFileInputFormat (https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)
在文件的映射器名称中你可以通过这个命令得到:
You can use WholeFileInputFormat (https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)
In mapper name of the file u can get by this command: