从 PDF 中复制和粘贴文本会产生垃圾

发布于 2024-09-02 22:30:09 字数 466 浏览 10 评论 0原文

我正在写硕士论文——NLP系统。我有一个组件 - 提取器。

它从 PDF 文件中提取纯文本。有一些 PDF 文件无法正确提取。 Extractor（PDFBox 库）返回一个如下字符串：

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

或

“10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17”

我正在检查导致此提取问题的每个文件，并且所有这些文件的文本也无法从 PDF 阅读器（Adobe Reader 和 FoxIt 阅读器）复制粘贴。可以在此阅读器中查看它们，但在选择其内容并复制到剪贴板后，我得到相同的错误文本（如上所述 - 语义不正确的字符字符串或数字和字母字符串）。

有人可以帮我吗？？？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

手心的海 2024-09-09 22:30:09

通常在这种情况下，您无法从 Acrobat (Reader) 窗口中选择、复制和粘贴文本，但还有另一个选项可以使用：

打开“文件”菜单，
选择“另存为...”，
选择“文本（正常）(*.txt)”，
浏览到目标目录，
键入要用于该文件的名称文本文件。

您将拥有文件中所有页面的所有文本，并且需要找到您最初想要复制和粘贴的位置 - 就其而言，它不如直接复制和粘贴那么舒服。但它工作得更可靠......

它也可以与 Linux 上的 acroread 一起使用（但您必须从文件菜单中选择“另存为文本...”）。

更新

您可以使用 pdffonts 命令行实用程序来快速分析 PDF 使用的字体。

下面是一个示例输出，它演示了文本提取问题很可能发生的位置。它使用来自 GitHub-Repository< 的手动编码 PDF 文件之一/a> 它的创建是为了提供 PDF 示例文件，这些文件有很好的注释，并且可以在文本编辑器中轻松打开：

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

如何解释此表？

上面的 PDF 文件使用两种子集字体（如其名称的 BAAAA+ 和 CAAAAA+ 前缀以及 sub 中的 yes 条目所示> 列）、Helvetica 和 Helvtica-Bold。
两种字体都是 TrueType 类型。
两种字体都使用 WinAnsi 编码（字体编码将 PDF 源代码中使用的字符标识符映射到应绘制的字形）。
但是，仅对于字体 /Helvetica，PDF 中才有可用的 /ToUnicode 表（对于 /Helvetica-Bold 则没有），如下所示由 uni 列中的 yes/no 指示）。

/ToUnicode 表需要提供从字符标识符/代码到字符的反向映射。

特定字体缺少 /ToUnicode 表几乎总是表明无法从 PDF 中提取或复制粘贴使用该字体的文本字符串。（即使存在 /ToUnicode 表，文本提取仍然可能会出现问题，因为该表可能已损坏、不正确或不完整 - 如许多真实的 PDF 文件，上面链接的 GitHub 存储库中的一些配套文件也证明了这一点。）

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.

You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

Update

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

How to interpret this table?

The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

回复收藏 0 原文