使用 Poppler 从 PDF 中提取文本 (C++)

发布于 2024-08-30 15:01:03 字数 207 浏览 14 评论 0原文

我正在尝试通过 Poppler 及其(缺乏)文档来解决问题。

我想做的是一件非常简单的事情:打开一个PDF文件并阅读其中的文本。然后我将处理文本,但这在这里并不重要。

所以...我看到了 poppler_page_get_text 函数,它有点工作,但我必须指定一个选择矩形,这不是很方便。难道不是有一个非常简单的函数可以按顺序输出 PDF 文本(也许是逐行输出?)。

I'm trying to get my way through Poppler and its (lack of) documentation.

What I want to do is a very simple thing: open a PDF file and read the text in it. I'm then going to process the text, but that doesn't really matter here.

So... I saw the poppler_page_get_text function, and it kind of works, but I have to specify a selection rectangle, which is not very handy. Isn't there just a very simple function that would output the PDF text in order (maybe line by line?).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

土豪我们做朋友吧 2024-09-06 15:01:03

您应该能够将选择矩形设置为页面的 pageSize/MediaBox 并获取所有文本。

我说应该是因为在您开始想知道为什么会对 poppler_page_get_text 的输出感到惊讶之前,您应该了解文本在页面上的布局方式。所有图形均使用以后修复符号表示的程序布置在页面上。为了渲染页面,该程序在空白页面上执行。

程序中的操作可以包括改变颜色、位置、当前变换矩阵、绘制直线、贝塞尔曲线等。文本由一系列文本运算符布局,这些运算符始终用 BT(开始文本)和 ET(结束文本)括起来。文本在页面上的放置方式或位置由生成 PDF 的软件自行决定。例如,对于打印驱动程序,代码响应对 DrawString 的 GDI 调用,并将其转换为文本绘制操作。

如果幸运的话,页面上的文本会以合理的顺序排列并使用合理的字体,但许多生成 PDF 的程序并不那么友好。例如,Psroff 喜欢首先放置所有纯文本,然后是斜体文本,然后是粗体文本。单词可能会也可能不会按阅读顺序排列。字体可以重新编码,以便 'a' 映射到 '{' 或其他内容。然后,您可能会使用多个字符被单个字形替换的连字 - 最常见的是 aeoefiflffl

完成所有这些后,提取文本的过程绝对不平凡,因此如果您看到文本提取的质量较差,请不要感到惊讶。

我曾经研究过 Acrobat 1.0 和 2.0 中的文本提取工具 - 要正确使用它确实是一个挑战。

You should be able to set the selection rectangle to the pageSize/MediaBox of the page and get all the text.

I say should because before you start wondering why you get surprised by the output of poppler_page_get_text, you should be aware of how text gets laid out on a page. All graphics are laid out on a page using a program expressed in post-fix notation. To render the page, this program is executed on a blank page.

Operations in the program can include, changing colors, position, current transformation matrix, drawing lines, bezier curves and so on. Text is laid out by a series of text operators that are always bracketed by BT (begin text) and ET (end text). How or where text is placed on a page is at the sole discretion of the software that generates the PDF. For example, for print drivers, the code responds to GDI calls for DrawString and translates that into text drawing operations.

If you are lucky, the text on the page is laid out in a sane order with sane font usage, but many programs that generate PDF aren't so kind. Psroff, for example liked to place all the plain text first, then the italic text, then the bold text. Words may or may not be placed in reading order. Fonts may be re-encoded so that 'a' maps to '{' or whatever. Then you might have ligatures where multiple characters are replaced by single glyphs - the most common ones are ae, oe, fi, fl, and ffl.

With all of this in place, the process of extracting text is decidedly non-trivial, so don't be surprised if you see poor quality results from text extraction.

I used to work on the text extraction tools in Acrobat 1.0 and 2.0 - it's a real challenge to get right.

甜心 2024-09-06 15:01:03

只是为了记录,我现在正在使用 poppler 与这个小程序,

#include <iostream>

#include "poppler-document.h"
#include "poppler-page.h"
using namespace std;

int main()
{
    poppler::document *doc = poppler::document::load_from_file("./CMI2APIDocV1.4.pdf");
    const int pagesNbr = doc->pages();
    cout << "page count: " << pagesNbr << endl;

    for (int i = 0; i < pagesNbr; ++i)
        cout << doc->create_page(i)->text().to_latin1().c_str() << endl;
}

// g++ -I/usr/include/poppler/cpp/ -c poppler.cpp
// g++ -I/usr/include/poppler/cpp poppler.o  /usr/lib/x86_64-linux-gnu/libpoppler-cpp.a /usr/lib/x86_64-linux-gnu/libpoppler.a /usr/lib/x86_64-linux-gnu/liblcms2.so     /usr/lib/x86_64-linux-gnu/libfontconfig.a /usr/lib/x86_64-linux-gnu/libjpeg.a /usr/lib/x86_64-linux-gnu/libfreetype.a     /usr/lib/x86_64-linux-gnu/libexpat.a /usr/lib/x86_64-linux-gnu/libz.a

到目前为止,我对结果非常满意,除了纯文本中的数组和“电子表格”恢复,有时单个单元格可能会跨越多条线。 (如果有人知道如何避免这种情况?)

Just for the records, I am using poppler right now with this little program

#include <iostream>

#include "poppler-document.h"
#include "poppler-page.h"
using namespace std;

int main()
{
    poppler::document *doc = poppler::document::load_from_file("./CMI2APIDocV1.4.pdf");
    const int pagesNbr = doc->pages();
    cout << "page count: " << pagesNbr << endl;

    for (int i = 0; i < pagesNbr; ++i)
        cout << doc->create_page(i)->text().to_latin1().c_str() << endl;
}

// g++ -I/usr/include/poppler/cpp/ -c poppler.cpp
// g++ -I/usr/include/poppler/cpp poppler.o  /usr/lib/x86_64-linux-gnu/libpoppler-cpp.a /usr/lib/x86_64-linux-gnu/libpoppler.a /usr/lib/x86_64-linux-gnu/liblcms2.so     /usr/lib/x86_64-linux-gnu/libfontconfig.a /usr/lib/x86_64-linux-gnu/libjpeg.a /usr/lib/x86_64-linux-gnu/libfreetype.a     /usr/lib/x86_64-linux-gnu/libexpat.a /usr/lib/x86_64-linux-gnu/libz.a

I am quite happy with th result so far, except for arrays and "spreadsheet" restitution in pure text, where sometime a single cell may span through multiple lines. (if someone knows how to avoid that ?)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文