使用java一次读取一页pdf uploadstream

发布于 2024-07-14 02:56:25 字数 256 浏览 10 评论 0原文

我正在尝试在 j2ee 应用程序中阅读 pdf 文档。

对于网络应用程序,我必须将 pdf 文档存储在磁盘上。 为了使搜索变得容易,我想对文档内的文本进行反向索引; 如果是OCR的话。

使用 PDFbox 库,可以创建包含整个 pdf 文件的 pdfDocument 对象。 然而,为了保留内存并提高整体性能,我宁愿将文档作为流处理,并一次将一页读入缓冲区。

我想知道是否可以一页一页甚至一次一行地读取包含 pdf 的文件流。

I am trying to read a pdf document in a j2ee application.

For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.

With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.

I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜还是长夜 2024-07-21 02:56:25

对于给定的通用 pdf 文档,至少使用 PDFBox,您无法知道一页结束位置和另一页开始位置。

如果您关心的是资源的使用,我建议您将pdf文档解析为COSDocument,使用.getObjects()从COSDocument中提取解析后的对象,这将为您提供一个java.util.List。 这应该很容易融入您拥有的任何稀缺资源。

请注意,您可以通过 PDFBox API 轻松地将解析后的 pdf 文档转换为 Lucene 索引。

另外,在进入优化领域之前,请确保您确实需要它们。 PDFBox 能够毫不费力地在内存中表示相当大的 PDF 文档。

要从 InputStream 解析 PDF 文档,请查看 COSDocument

要编写 lucene 索引,请查看 LucenePDFDocument

对于 COSDocuments 的内存中表示,请查看 FDF文档

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.

If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.

Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.

Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.

For parsing the PDF document from an InputStream, look at the COSDocument class

For writing lucene indexes, look at LucenePDFDocument class

For in-memory representations of COSDocuments, look at FDFDocument

許願樹丅啲祈禱 2024-07-21 02:56:25

在 2.0.* 版本中,像这样打开 PDF:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

这会将缓冲内存使用设置为仅使用没有大小限制的临时文件(无主内存)。

此处回答了这个问题。

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.

This was answered here.

把梦留给海 2024-07-21 02:56:25

查看 PDF 渲染器 Java 库。 我自己试过了,看起来比PDFBox快很多。 不过,我还没有尝试过获取 OCR 文本。

以下是从上面的链接复制的示例,展示了如何将 PDF 页面绘制到图像中:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );
伏妖词 2024-07-21 02:56:25

我想您可以逐字节读取文件以查找分页符。 由于可能存在 PDF 格式问题,逐行更加困难。

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文