当前位置：文江博客话题详情

使用java一次读取一页pdf uploadstream

发布于 2024-07-14 02:56:25 字数 256 浏览 10 评论 0原文

我正在尝试在 j2ee 应用程序中阅读 pdf 文档。

对于网络应用程序，我必须将 pdf 文档存储在磁盘上。为了使搜索变得容易，我想对文档内的文本进行反向索引；如果是OCR的话。

使用 PDFbox 库，可以创建包含整个 pdf 文件的 pdfDocument 对象。然而，为了保留内存并提高整体性能，我宁愿将文档作为流处理，并一次将一页读入缓冲区。

我想知道是否可以一页一页甚至一次一行地读取包含 pdf 的文件流。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜还是长夜 2024-07-21 02:56:25

对于给定的通用 pdf 文档，至少使用 PDFBox，您无法知道一页结束位置和另一页开始位置。

如果您关心的是资源的使用，我建议您将pdf文档解析为COSDocument，使用.getObjects()从COSDocument中提取解析后的对象，这将为您提供一个java.util.List。这应该很容易融入您拥有的任何稀缺资源。

请注意，您可以通过 PDFBox API 轻松地将解析后的 pdf 文档转换为 Lucene 索引。

另外，在进入优化领域之前，请确保您确实需要它们。 PDFBox 能够毫不费力地在内存中表示相当大的 PDF 文档。

要从 InputStream 解析 PDF 文档，请查看 COSDocument 类

要编写 lucene 索引，请查看 LucenePDFDocument 类

对于 COSDocuments 的内存中表示，请查看 FDF文档

回复收藏 0 原文

許願樹丅啲祈禱 2024-07-21 02:56:25

在 2.0.* 版本中，像这样打开 PDF：

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

这会将缓冲内存使用设置为仅使用没有大小限制的临时文件（无主内存）。

此处回答了这个问题。

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.

This was answered here.

回复收藏 0 原文

把梦留给海 2024-07-21 02:56:25

查看 PDF 渲染器 Java 库。我自己试过了，看起来比PDFBox快很多。不过，我还没有尝试过获取 OCR 文本。

以下是从上面的链接复制的示例，展示了如何将 PDF 页面绘制到图像中：

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

回复收藏 0 原文