当前位置：文江博客话题详情

如何从pdf中提取文本图层和背景图层？

发布于 2024-12-03 15:25:35 字数 448 浏览 1 评论 0 原文

在我的项目中，我必须使用 HTML5/CSS3 制作 PDF 查看器，并且该应用程序必须允许用户添加评论和注释。实际上，我必须做一些与 crocodoc.com 非常相似的事情。

一开始我想从 PDF 创建图像并允许用户创建区域并在该区域发表评论。不幸的是，客户还希望在此 PDF 中导航并仅在允许的部分（例如段落或选定的文本）上添加注释。

现在我面临一个问题，那就是获取文本以及最好的方法。如果任何人有一些线索我如何达到它，我将不胜感激。

我尝试了 pdftohtml，但输出看起来并不像原始文档那样非常复杂（文档示例）。即使这个并不能真正反映输出，但比 pdftohtml 好得多。

我对任何解决方案持开放态度，更喜欢 Linux 下的命令行。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寂寞陪衬 2024-12-10 15:25:35

我和你经历过同样的道路，任务更加复杂。

在尝试了一切之后，我最终在 Mono 下使用 C#（因此它在 Linux 上运行）和 iTextSharp。

即使使用 iTextSharp 等非常完整的库，某些任务也需要进行反复试验:)

从页面中提取文本很容易（检查下面的截图），但是如果您打算保留文本坐标、字体和尺寸，您将有更多工作要做。

int pdf_page = 5;
string page_text = "";

PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
    if(token.TokenType == PRTokeniser.TokType.STRING)
    {
        page_text += token.StringValue;
    }
    else if(token.StringValue == "Tj")
    {
        page_text += " ";
    }
}

对所有标记执行 Console.WriteLine(token.StringValue) 以查看 PDF 中文本段落的结构。通过这种方式，您可以检测坐标、字体、字体大小等。

补充：

鉴于您需要执行的任务，我为您提供一个建议：

使用坐标以及字体系列和大小提取文本 -有关每个段落的所有信息。然后，对于 PDF 转图像，并在在线查看器中，根据需要在图像上的段落上应用不可见的可选择文本。

这样您的用户就可以根据需要选择文本的一部分，而无需在 html 中重建整个 PDF :)

I've been down the same road as you, with even much more complex tasks.

After trying out everything I ended up using C# under Mono (so it runs on linux) with iTextSharp.

Even with a very complete library such as iTextSharp, some tasks required allot of trial-and-error :)

To extract the text from a page is easy (check the below snipper), however if you intend to keep the text coordinates, fonts and sizes, you will have more work to do.

int pdf_page = 5;
string page_text = "";

PdfReader reader = new PdfReader("path/to/pdf/file.pdf");
PRTokeniser token = new PRTokeniser(reader.GetPageContent(pdf_page));
while(token.NextToken())
{
    if(token.TokenType == PRTokeniser.TokType.STRING)
    {
        page_text += token.StringValue;
    }
    else if(token.StringValue == "Tj")
    {
        page_text += " ";
    }
}

Do a Console.WriteLine(token.StringValue) on all tokens to see how paragraphs of text are structured in PDFs. This way you can detect coordinates, font, font size, etc.

Addition:

Given the task you are required to do, I have a suggestion for you:

Extract the text with coordinates and font families and sizes - all information about each paragraph. Then, to a PDF-to-images, and in your online viewer, apply invisible selectable text over the paragraphs on the image where needed.

This way your users can select a part of the text where needed, without the need of reconstructing the whole PDF in html :)

回复收藏 0 原文

A君 2024-12-10 15:25:35

我最近研究并发现了一个本地 PHP 解决方案，可以使用 FOSS 来实现此目的。 FPDI PHP 类可用于导入 PDF 文档以供使用TCPDF 或 FPDF PHP 类，两者它提供创建、阅读、更新和编写 PDF 文档的功能。就我个人而言，我更喜欢TCPDF，因为它提供了更大的功能集(TCPDF 与 FPDF) ，更丰富的 API (TCPDF 与 FPDF)，更多使用示例( TCPDF 与 FPDF) 和更活跃的社区论坛(TCPDF 与 FPDF)。

选择前面提到的类之一或其他类，以编程方式处理 PDF 文档。关注当前和未来可能的交付成果以及所需的用户体验，决定在何处（例如服务器 - PHP，客户端 - JavaScript，两者）以及程度（功能驱动）< /em> 你的交互逻辑应该被实现。

就我个人而言，我会使用通过 FPDI 导入 PDF 文档获得的 TCPDF 实例来迭代检查、转换为通用格式（XML、JSON 等），并将结果表示存储在关系表中，这些表旨在保留与所需的文档层次结构和细节级别相关的数据。必要的详细程度通常由规范文档及其对当前和未来可能的可交付成果的提及来规定。

注意：在这种情况下，我强烈建议翻译文档并以通用格式存储它们，以创建抽象和透明层。例如，未来可能的且不可预见的交付成果可能是为上传 Microsoft Word 文档的用户提供相同的应用程序功能。如果上传的 Microsoft Word 文档未翻译并以通用格式存储，则几乎肯定需要更新 Web 服务 API 和相关业务逻辑。这最终会导致存储臃肿、次优的数据，以及在设计、开发和支持多个翻译器时开发资源的低效使用。与仅将入站数据转换为最佳格式一次相比，为每个请求转换出站数据也是对服务器资源的低效利用。

然后，我将通过设计和关联附加表来扩展基本文档表，以保留功能特定的文档资产数据，例如：

版本化添加/编辑/删除

内容
- 页眉/页脚
- 文本
  - 原始值
  - 新价值
- 图片
  - 页面（一页、多页或全部）
  - 位置（相对 - 文本锚点，绝对 - x/y 坐标）
  - 文件（相对或绝对目录或网址）
- 画笔 （绘图）
  - 页面（一页、多页或全部）
  - 位置（相对 - 文本锚点，绝对 - x/y 坐标）
  - 形状（重画直线、正方形、圆形、用户定义等的 x/y 坐标）
  - 类型（钢笔、铅笔、记号笔等）
  - 粗细（1px、3px、5px等）
  - 颜色
- 注释
  - 页面
  - 位置（相对 - 文本锚点，绝对 - x/y 坐标）
  - 形状（直线、方形、圆形、用户定义等）
  - 值（注释文本）
- 评论
  - 目标（页面、另一个文本/图像/画笔/注释资源、父评论 - 线程）
  - 值（注释文本）
何时
- 日期
- 时间
谁
- 用户

一旦部分、全部或更多文档及其资产数据有一个持久的位置，我将设计、记录和开发 PHP Web 服务 API，以向 UI 使用者公开 CRUD 和 PDF 文档上传功能，同时执行核心业务规则。至此，剩下的工作就落在客户端了。目前，我有保存文档及其资产数据的关系表，以及向消费者公开足够功能的 API，在本例中为客户端 JavaScript。

我现在可以使用最新的 Web 技术（例如 HTML5、JavaScript 和 CSS3）设计和开发客户端应用程序。我可以使用 Web 服务 API 上传和请求 PDF 文档，并轻松地将返回的通用格式呈现给浏览器，但我决定（在本例中可能是 HTML）。然后，我可以使用 100% 原生 JavaScript 和/或第 3 方库来实现 DOM 帮助程序功能，创建矢量图形以提供绘图和注释功能，以及访问和控制当前所选文档文本和/或图像的功能和风格属性。我可以通过使用 WebSockets（之前提到的 WebService API 不适用）提供实时协作体验，或者使用 XMLHttpRequest 提供半延迟但仍然相当无缝的体验。

从现在开始，天空就是极限，球就在你的球场上！

I recently researched and discovered a native PHP solution to achieve this using FOSS. The FPDI PHP class can be used to import a PDF document for use with either the TCPDF or FPDF PHP classes, both of which provide functionality for creating, reading, updating and writing PDF documents. Personally, I prefer TCPDF as it provides a larger feature set (TCPDF vs. FPDF), a richer API (TCPDF vs. FPDF), more usage examples (TCPDF vs. FPDF) and a more active community forum (TCPDF vs. FPDF).

Choose one of the before mentioned classes, or another, to programmatically handle PDF documents. Focusing on both current and possible future deliverables, as well as the desired user experience, decide where (e.g. server - PHP, client - JavaScript, both) and to what extent (feature driven) your interactive logic should be implemented.

Personally, I would use a TCPDF instance obtained by importing a PDF document via FPDI to iteratively inspect, translate to a common format (XML, JSON, etc.) and store the resulting representation in relational tables designed to persist data pertinent to the desired level of document hierarchy and detail. The necessary level of detail is often dictated by a specifications document and its mention of both current and possible future deliverables.

Note: In this case, I strongly advise translating documents and storing them in a common format to create a layer of abstraction and transparency. For example, a possible and unforeseen future deliverable might be to provide the same application functionality for users uploading Microsoft Word documents. If the uploaded Microsoft Word document was not translated and stored in a common format then updates to the Web service API and dependent business logic would almost certainly be necessary. This ultimately results in storing bloated, sub-optimal data and inefficient use of development resources in designing, developing and supporting multiple translators. It would also be an inefficient use of server resources to translate outbound data for every request, as opposed to translating inbound data to an optimal format only once.

I would then extend the base document tables by designing and relating additional tables for persisting functionality specific document asset data such as:

Versioned Additions / Edits / Deletions

What
- Header / Footer
- Text
  - Original Value
  - New Value
- Image
  - Page(s) (one, many or all)
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - File (relative or absolute directory or url)
- Brush (drawing)
  - Page(s) (one, many or all)
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - Shape (x/y coordinates to redraw line, square, circle, user defined, etc.)
  - Type (pen, pencil, marker, etc.)
  - Weight (1px, 3px, 5px, etc.)
  - Color
- Annotation
  - Page
  - Location (relative - textual anchor, absolute - x/y coordinates)
  - Shape (line, square, circle, user defined, etc.)
  - Value (annotation text)
- Comment
  - Target (page, another text/image/brush/annotation asset, parent comment - threading)
  - Value (comment text)
When
- Date
- Time
Who
- User

Once some, all or more, of the document and its asset data has a place to persist I would design, document and develop a PHP Web service API to expose CRUD and PDF document upload functionality to the UI consumer, while enforcing core business rules. At this point, the remaining work now lies on the Client-side. Currently, I have relational tables persisting both a document and its asset data, as well as an API exposing sufficient functionality to the consumer, in this case the Client-side JavaScript.

I can now design and develop a Client-side application using the latest Web technologies such as HTML5, JavaScript and CSS3. I can upload and request PDF documents using the Web service API and easily render the returned common format out to the browser however I decide (probably HTML in this case). I can then use 100% native JavaScript and/or 3rd party libraries for DOM helper functionality, creating vector graphics to provide drawing and annotation features, as well as access and control functional and stylistic attributes of currently selected document text and/or images. I can provide a real-time collaborative experience by employing WebSockets (before mentioned WebService API does not apply), or a semi-delayed, but still fairly seamless experience using XMLHttpRequest.

From this point forward the sky is the limit and the ball is in your court!

回复收藏 0 原文