运行 Hadoop MapReduce 作业时如何获取文件名/文件内容作为 MAP 的键/值输入？

发布于 2024-11-02 09:39:11 字数 852 浏览 6 评论 0原文

我正在创建一个程序来分析 PDF、DOC 和 DOCX 文件。这些文件存储在 HDFS 中。

当我开始 MapReduce 作业时，我希望映射函数将文件名作为键，将二进制内容作为值。然后我想创建一个流阅读器，我可以将其传递给 PDF 解析器库。如何实现映射阶段的键/值对是文件名/文件内容？

我正在使用 Hadoop 0.20.2

这是启动作业的旧代码：

public static void main(String[] args) throws Exception {
 JobConf conf = new JobConf(PdfReader.class);
 conf.setJobName("pdfreader");

 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);

 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);

 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);

 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));

 JobClient.runJob(conf);
}

我知道还有其他输入格式类型。但有没有一个完全符合我的要求呢？我发现文档相当模糊。如果有可用的，那么 Map 函数输入类型应该是什么样的？

提前致谢！

原文

I am creating a program to analyze PDF, DOC and DOCX files. These files are stored in HDFS.

When I start my MapReduce job, I want the map function to have the Filename as key and the Binary Contents as value. I then want to create a stream reader which I can pass to the PDF parser library. How can I achieve that the key/value pair for the Map Phase is filename/filecontents?

I am using Hadoop 0.20.2

This is older code that starts a job:

public static void main(String[] args) throws Exception {
 JobConf conf = new JobConf(PdfReader.class);
 conf.setJobName("pdfreader");

 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);

 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);

 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);

 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));

 JobClient.runJob(conf);
}

I Know there are other inputformat types. But is there one that does exactly what I want? I find the documentation quite vague. If there is one available, then how should the Map function input types look?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我只土不豪 2024-11-09 09:39:11

解决方案是创建您自己的 FileInputFormat 类来执行此操作。
您可以从此 FileInputFormat 接收的 FileSplit (getPath) 访问输入文件的名称。
请务必否决 FileInputformat 的 isSplitable 以始终返回 false。

您还需要一个自定义 RecordReader，它将整个文件作为单个“Record”值返回。

处理太大的文件时要小心。您将有效地将整个文件加载到 RAM 中，任务跟踪器的默认设置是只有 200MB 可用 RAM。

回复收藏 0 原文

眼前雾蒙蒙 2024-11-09 09:39:11

作为您的方法的替代方案，可以直接将二进制文件添加到 hdfs。然后，创建一个包含所有二进制文件的 dfs 路径的输入文件。这可以使用 Hadoop 的文件系统< /a> 类。最后，再次使用文件系统，创建一个通过打开输入流来处理输入的映射器。

回复收藏 0 原文

淡水深流 2024-11-09 09:39:11

您可以使用 WholeFileInputFormat (https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)

在文件的映射器名称中你可以通过这个命令得到：

public void map(NullWritable key, BytesWritable value, Context context) throws 
IOException, InterruptedException 
{       

Path filePath= ((FileSplit)context.getInputSplit()).getPath();
String fileNameString = filePath.getName();

byte[] fileContent = value.getBytes();

}

You can use WholeFileInputFormat (https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)

In mapper name of the file u can get by this command:

public void map(NullWritable key, BytesWritable value, Context context) throws 
IOException, InterruptedException 
{       

Path filePath= ((FileSplit)context.getInputSplit()).getPath();
String fileNameString = filePath.getName();

byte[] fileContent = value.getBytes();

}

回复收藏 0 原文

~没有更多了~