在 Hadoop MapReduce 中解析 PDF 文件
我必须在 Hadoop 的 MapReduce 程序中解析 HDFS 中的 PDF 文件。所以我从 HDFS 获取 PDF 文件作为输入分割,它必须被解析并发送到 Mapper 类。为了实现这个InputFormat,我浏览了这个链接。如何解析这些输入拆分并将其转换为文本格式?
I have to parse PDF files , that are in HDFS in a Map Reduce Program in Hadoop. So i get the PDF file from HDFS as Input splits and it has to be parsed and sent to the Mapper Class. For implementing this InputFormat I had gone through this link . How can the these input splits be parsed and converted into text format ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 Hadoop 中处理 PDF 文件可以通过扩展 FileInputFormat 类。让扩展它的类成为 WholeFileInputFormat。在 WholeFileInputFormat 类中,您重写 getRecordReader() 方法。现在,每个 pdf 都将作为单独输入拆分接收。然后可以解析这些单独的分割以提取文本。此链接提供了一个了解如何扩展 FileInputFormat 的清晰示例。
Processing PDF files in Hadoop can be done by extending FileInputFormat Class. Let the class extending it be WholeFileInputFormat. In the WholeFileInputFormat class you override the getRecordReader() method. Now each pdf will be received as an Individual Input Split. Then these individual splits can be parsed to extract the text. This link gives a clear example of understanding how to extend FileInputFormat.
这取决于你的分裂。我认为(可能是错误的)您需要将每个 PDF 作为一个整体来解析它。有 Java 库可以做到这一点,Google 知道它们在哪里。
鉴于此,您需要使用一种方法,在准备解析文件时将文件作为一个整体。假设您想在映射器中执行此操作,则需要一个将整个文件传递给映射器的读取器。您可以编写自己的阅读器来执行此操作,或者可能已经有一个阅读器了。您可以构建一个扫描 PDF 目录的阅读器,并将每个文件的名称作为键传递到映射器中,将内容作为值传递。
It depends on your splits. I think (could be wrong) that you'll need each PDF as a whole in order to parse it. There are Java libraries to do this, and Google knows where they are.
Given that, you'll need to use an approach where you have the file as a whole when you're ready to parse it. Assuming you'd want to do that in the mapper, you'd need a reader that would hand whole files to the mapper. You could write your own reader to do this, or perhaps there's one already out there. You could possibly build a reader that scans the directory of PDFs and passes the name of each file as the key into the mapper and the contents as the value.