我可以推荐 PDF 鹰嘴豆泥。它是一个免费的跨平台 PDF 创建和编辑 C++ 库: https://github.com/galkahana/PDF -作家。 您只需要构建该库并将其及其依赖项链接到您的代码。您可以做简单的事情,也可以深入创建复杂的 pdf。该文档相当不错,可以通过他们的 github wiki 页面访问。
I can recommend PDF Hummus. It's a free cross platform PDF creation and editing library in c++: https://github.com/galkahana/PDF-Writer. You just need to build the library and link it and its dependencies to your code. You can do simple stuff but also dive in deep to create complex pdfs. The documentation is fairly good and can be accessed through their github wiki page.
PDF 中的页面内容由在页面上绘制的短 RPN 程序表示。它是一种在语义上类似于 PostScript 的小型语言,但没有循环结构或函数定义(因此不存在停机问题)。在正常情况下,页面上的文本将由如下所示的内容表示:
BT /F1 12 Tf 72 720 Td (this is a text in a pdf document) Tj ET
当翻译成更熟悉的内容时,是这样的:
BeginText();
SetFont(F1, 12.0); // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a text in a pdf document");
EndText();
所以在这种情况下,您必须将其转换为这样的内容:
BeginText();
SetFont(F1, 12.0); // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a ");
SetFont(F2, 12);
ShowText("text");
SetFont(F1, 12);
ShowText(" in a pdf document");
EndText();
它将变成:
BT /F1 12 Tf 72 720 Td (this is a ) Tj /F2 12 Tf (text) Tj /F1 12 Tf
( in a pdf document) Tj ET
在等效的 PDF 中。问题是多方面的:
您必须提取页面及其所有资源(非常重要)
您必须生成一个新页面,插入新资源(您要添加新字体),如果允许则嵌入字体
改变页面的内容流以包含您更改的内容。
3 是你会遇到困难的地方,因为有无数种方法可以生成包含你描述的内容的页面,即使有一个像样的库,你也很难获得 70其中%。让我简要描述一下为什么这听起来那么糟糕。有一些 PDF 生成程序(我正在看着你,troff)首先将所有纯文本放置在页面上,然后放置所有斜体文本,然后放置所有粗体文本。我发誓,这不是我编造的。有些程序想要非常精确地放置文本,因此如果幸运的话,它们会使用 TJ 运算符来以特定的字距调整来放置文本。如果您不幸运(大多数情况下),他们会在页面上的每个字形之前通过一组移动来布置文本。如果您的文本位于曲线或不寻常的方向(地图、广告)上怎么办?如果有人巧妙地更改字体大小以更好地区分大小写或模拟小型大写字母,情况又如何呢?
我不会向您推荐一个库 - 抱歉 - 我对 xpdf 进行了简要的浏览,并不清楚它是否具有 PDF 生成功能,或者它是否只是 PDF 的使用者。 PdfLib是一个商业产品,似乎是生成PDF,虽然不清楚它是否可以使用它,但你当然可以通过将它们粘合在一起来获得双方。
如果是我,我会使用我开发的工具,但对于这项任务我仍然会有点害羞。我工作的公司 Atalasoft 使用我的库来从整个布料生成 PDF 并在其中进行编辑非常有限的领域(注释、文档元数据)。最困难的部分是我们尽最大努力向客户隐藏 PDF 的复杂性。一般来说,我们的客户希望我们而不是他们来理解规范,并使其余的事情变得简单 - 但像这样的任务(编辑是另一项),如果不了解 PDF 的深度,确实很难完成规格。如果您开始进入 PDF 操作的库世界,您应该从阅读规范开始,尤其是第 8 章(图形)和第 9 章(文本),您将更好地理解您将要做什么与图书馆。
Just so you understand the scope of what you're getting into, "basic editing" of PDF content is nearly always non-trivial.
Page content in PDF is represented by short RPN programs that paint on the page. It's a small language similar to PostScript in semantics, but without looping structures or function definitions (so there is no halting problem). In a sane world, your text on the page is going to be represented by something like this:
BT /F1 12 Tf 72 720 Td (this is a text in a pdf document) Tj ET
which when translated into something more familiar, is this:
BeginText();
SetFont(F1, 12.0); // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a text in a pdf document");
EndText();
So in this case, you have to transform this into something like this:
BeginText();
SetFont(F1, 12.0); // Font 1, 12.0 pt
TextMoveTo(72, 720);
ShowText("this is a ");
SetFont(F2, 12);
ShowText("text");
SetFont(F1, 12);
ShowText(" in a pdf document");
EndText();
which would become:
BT /F1 12 Tf 72 720 Td (this is a ) Tj /F2 12 Tf (text) Tj /F1 12 Tf
( in a pdf document) Tj ET
in the equivalent PDF. The problem is many-fold:
You have to extract out the page and all its resources (non-trivial)
You have to generate a new page, inserting new resources (you're adding a new font), embedding the font if allowable
Alter the content stream of the page to include your changed content.
And 3 is where you're going to get hung up, because there are an infinite number of ways to generate a page that has the content you describe and even with a decent library, you're going to have a hard time getting maybe 70% of them. Let me briefly describe why this is as bad as it sounds. There are PDF generation programs (I'm looking at you, troff) that lay all the plain text on a page first, then lay all the italic text, then all the bold text. I swear, I'm not making this up. Some programs want to lay text down very precisely, so if you're lucky, they'll use the TJ operator which lays out text with specific kerning. If you're not lucky (which is most of the time), they're instead lay out the text with a set of moves before every single glyph on the page. And what if your text is laid our on a curve or an unusual orientation (maps, ads)? What about the cases where someone subtly changes the font size for a greater distinction between upper and lower case or simulates small caps?
This is why, when I wrote the find text tool for Acrobat 1.0, it took me two months of sweat to handle as many of the edge cases. This is not editing text - it's just trying to find a single word or phrase.
I'm not going to recommend a library for you - sorry - I gave xpdf a brief look over and it's not clear whether or not it has PDF generation capabilities or if it is simply a consumer of PDF. PdfLib, which is a commercial product, appears to be to generate PDF, although it's not clear if it can consume it, but you could certainly get both sides by gluing them together.
If it were me, I would use tools that I've developed and I'd still be a little shy of this task. My library is being used by Atalasoft, the company I work for, to generate PDFs from whole cloth and to do editing within a very limited domain (annotations, document metadata). The hardest part is that we do our very best to hide the complexity of PDF from our customers. In general, our customers want us to understand the spec instead of them and make the rest easy - but tasks like this (redaction is another one), are really hard to do without understanding the depth of the PDF specification. If you start entering the library world of PDF manipulation, you should start with reading the spec, especially chapter 8 (Graphics) and chapter 9 (Text), and you'll get a better understanding of what you're going to have to do with the library.
但请注意,如果您希望编辑由您无法控制的工具生成的 PDF 中的文本,您可能会遇到一些问题。问题是 - @plinth 提到过 - 有很多方法可以生成看起来相似的文本,但在上下文上却彼此非常不同。
Although not a library in traditional sense, Pdfedit has scriptable editing capabilities. But it requires QT. PodoFo probably fits best at your requirements. There's also PdfHummus.
But beware that if you're expecting to edit text from PDF generated by tools outside your control, you'll probably face some issues. The problem is - and @plinth mentioned it - there are many ways to generated text which would look similar but that contextually are very different from each other.
发布评论
评论(5)
我可以推荐 PDF 鹰嘴豆泥。它是一个免费的跨平台 PDF 创建和编辑 C++ 库: https://github.com/galkahana/PDF -作家。
您只需要构建该库并将其及其依赖项链接到您的代码。您可以做简单的事情,也可以深入创建复杂的 pdf。该文档相当不错,可以通过他们的 github wiki 页面访问。
I can recommend PDF Hummus. It's a free cross platform PDF creation and editing library in c++: https://github.com/galkahana/PDF-Writer.
You just need to build the library and link it and its dependencies to your code. You can do simple stuff but also dive in deep to create complex pdfs. The documentation is fairly good and can be accessed through their github wiki page.
为了让您了解您所涉及的范围,PDF 内容的“基本编辑”几乎总是很重要。
PDF 中的页面内容由在页面上绘制的短 RPN 程序表示。它是一种在语义上类似于 PostScript 的小型语言,但没有循环结构或函数定义(因此不存在停机问题)。在正常情况下,页面上的文本将由如下所示的内容表示:
当翻译成更熟悉的内容时,是这样的:
所以在这种情况下,您必须将其转换为这样的内容:
它将变成:
在等效的 PDF 中。问题是多方面的:
3 是你会遇到困难的地方,因为有无数种方法可以生成包含你描述的内容的页面,即使有一个像样的库,你也很难获得 70其中%。让我简要描述一下为什么这听起来那么糟糕。有一些 PDF 生成程序(我正在看着你,troff)首先将所有纯文本放置在页面上,然后放置所有斜体文本,然后放置所有粗体文本。我发誓,这不是我编造的。有些程序想要非常精确地放置文本,因此如果幸运的话,它们会使用 TJ 运算符来以特定的字距调整来放置文本。如果您不幸运(大多数情况下),他们会在页面上的每个字形之前通过一组移动来布置文本。如果您的文本位于曲线或不寻常的方向(地图、广告)上怎么办?如果有人巧妙地更改字体大小以更好地区分大小写或模拟小型大写字母,情况又如何呢?
这就是为什么当我为 Acrobat 1.0 编写查找文本工具时,我花了两个月的时间来处理尽可能多的边缘情况。这不是编辑文本 - 它只是试图找到单个单词或短语。
我不会向您推荐一个库 - 抱歉 - 我对 xpdf 进行了简要的浏览,并不清楚它是否具有 PDF 生成功能,或者它是否只是 PDF 的使用者。 PdfLib是一个商业产品,似乎是生成PDF,虽然不清楚它是否可以使用它,但你当然可以通过将它们粘合在一起来获得双方。
如果是我,我会使用我开发的工具,但对于这项任务我仍然会有点害羞。我工作的公司 Atalasoft 使用我的库来从整个布料生成 PDF 并在其中进行编辑非常有限的领域(注释、文档元数据)。最困难的部分是我们尽最大努力向客户隐藏 PDF 的复杂性。一般来说,我们的客户希望我们而不是他们来理解规范,并使其余的事情变得简单 - 但像这样的任务(编辑是另一项),如果不了解 PDF 的深度,确实很难完成规格。如果您开始进入 PDF 操作的库世界,您应该从阅读规范开始,尤其是第 8 章(图形)和第 9 章(文本),您将更好地理解您将要做什么与图书馆。
Just so you understand the scope of what you're getting into, "basic editing" of PDF content is nearly always non-trivial.
Page content in PDF is represented by short RPN programs that paint on the page. It's a small language similar to PostScript in semantics, but without looping structures or function definitions (so there is no halting problem). In a sane world, your text on the page is going to be represented by something like this:
which when translated into something more familiar, is this:
So in this case, you have to transform this into something like this:
which would become:
in the equivalent PDF. The problem is many-fold:
And 3 is where you're going to get hung up, because there are an infinite number of ways to generate a page that has the content you describe and even with a decent library, you're going to have a hard time getting maybe 70% of them. Let me briefly describe why this is as bad as it sounds. There are PDF generation programs (I'm looking at you, troff) that lay all the plain text on a page first, then lay all the italic text, then all the bold text. I swear, I'm not making this up. Some programs want to lay text down very precisely, so if you're lucky, they'll use the TJ operator which lays out text with specific kerning. If you're not lucky (which is most of the time), they're instead lay out the text with a set of moves before every single glyph on the page. And what if your text is laid our on a curve or an unusual orientation (maps, ads)? What about the cases where someone subtly changes the font size for a greater distinction between upper and lower case or simulates small caps?
This is why, when I wrote the find text tool for Acrobat 1.0, it took me two months of sweat to handle as many of the edge cases. This is not editing text - it's just trying to find a single word or phrase.
I'm not going to recommend a library for you - sorry - I gave xpdf a brief look over and it's not clear whether or not it has PDF generation capabilities or if it is simply a consumer of PDF. PdfLib, which is a commercial product, appears to be to generate PDF, although it's not clear if it can consume it, but you could certainly get both sides by gluing them together.
If it were me, I would use tools that I've developed and I'd still be a little shy of this task. My library is being used by Atalasoft, the company I work for, to generate PDFs from whole cloth and to do editing within a very limited domain (annotations, document metadata). The hardest part is that we do our very best to hide the complexity of PDF from our customers. In general, our customers want us to understand the spec instead of them and make the rest easy - but tasks like this (redaction is another one), are really hard to do without understanding the depth of the PDF specification. If you start entering the library world of PDF manipulation, you should start with reading the spec, especially chapter 8 (Graphics) and chapter 9 (Text), and you'll get a better understanding of what you're going to have to do with the library.
xpdf 是一个只读 PDF 库。它不能编写PDF,更不用说修改内容了。
xpdf is a read-only PDF library. It can't write PDF much less modify contents.
您是否使用过 Itext/ISharp 来编辑 PDF 文件?
Have you looked at Itext/ISHarp to edit PDF files?
尽管不是传统意义上的库,Pdfedit 具有可编写脚本的编辑功能。但它需要QT。 PodoFo 可能最适合您的要求。还有 PdfHummus。
但请注意,如果您希望编辑由您无法控制的工具生成的 PDF 中的文本,您可能会遇到一些问题。问题是 - @plinth 提到过 - 有很多方法可以生成看起来相似的文本,但在上下文上却彼此非常不同。
Although not a library in traditional sense, Pdfedit has scriptable editing capabilities. But it requires QT. PodoFo probably fits best at your requirements. There's also PdfHummus.
But beware that if you're expecting to edit text from PDF generated by tools outside your control, you'll probably face some issues. The problem is - and @plinth mentioned it - there are many ways to generated text which would look similar but that contextually are very different from each other.