从 PDF 中复制和粘贴文本会产生垃圾

发布于 2024-09-02 22:30:09 字数 466 浏览 5 评论 0原文

我正在写硕士论文——NLP系统。我有一个组件 - 提取器。

它从 PDF 文件中提取纯文本。有一些 PDF 文件无法正确提取。 Extractor(PDFBox 库)返回一个如下字符串:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

“10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17”

我正在检查导致此提取问题的每个文件,并且所有这些文件的文本也无法从 PDF 阅读器(Adobe Reader 和 FoxIt 阅读器)复制粘贴。可以在此阅读器中查看它们,但在选择其内容并复制到剪贴板后,我得到相同的错误文本(如上所述 - 语义不正确的字符字符串或数字和字母字符串)。

有人可以帮我吗???

I am writing a Master's thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:

"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"

or

"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"

I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).

Could anybody help me???

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

手心的海 2024-09-09 22:30:09

通常在这种情况下,您无法从 Acrobat (Reader) 窗口中选择、复制和粘贴文本,但还有另一个选项可以使用:

  • 打开“文件”菜单,
  • 选择“另存为...”
  • 选择“文本(正常)(*.txt)”
  • 浏览到目标目录,
  • 键入要用于该文件的名称文本文件。

您将拥有文件中所有页面的所有文本,并且需要找到您最初想要复制和粘贴的位置 - 就其而言,它不如直接复制和粘贴那么舒服。但它工作得更可靠......

它也可以与 Linux 上的 acroread 一起使用(但您必须从文件菜单中选择“另存为文本...”)。

更新

您可以使用 pdffonts 命令行实用程序来快速分析 PDF 使用的字体。

下面是一个示例输出,它演示了文本提取问题很可能发生的位置。它使用来自 GitHub-Repository< 的手动编码 PDF 文件之一/a> 它的创建是为了提供 PDF 示例文件,这些文件有很好的注释,并且可以在文本编辑器中轻松打开:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

如何解释此表?

  • 上面的 PDF 文件使用两种子集字体(如其名称的 BAAAA+CAAAAA+ 前缀以及 sub 中的 yes 条目所示> 列)、HelveticaHelvtica-Bold
  • 两种字体都是 TrueType 类型。
  • 两种字体都使用 WinAnsi 编码(字体编码将 PDF 源代码中使用的字符标识符映射到应绘制的字形)。
    但是,仅对于字体 /Helvetica,PDF 中才有可用的 /ToUnicode 表(对于 /Helvetica-Bold 则没有),如下所示由 uni 列中的 yes/no 指示)。

/ToUnicode 表需要提供从字符标识符/代码到字符的反向映射。

特定字体缺少 /ToUnicode 表几乎总是表明无法从 PDF 中提取或复制粘贴使用该字体的文本字符串。 (即使存在 /ToUnicode,文本提取仍然可能会出现问题,因为该表可能已损坏、不正确或不完整 - 如许多真实的 PDF 文件,上面链接的 GitHub 存储库中的一些配套文件也证明了这一点。)

Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

  • Open 'File' menu,
  • select 'Save as...',
  • select 'Text (normal) (*.txt)',
  • browse to the target directory,
  • type the name you want to use for the text file.

You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

Update

You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

$ pdffonts  textextract-bad2.pdf
  name                            type         encoding    emb sub uni object ID
  ------------------------------- ------------ ----------- --- --- --- ---------
  BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
  CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0

How to interpret this table?

  • The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
  • Both fonts are of type TrueType.
  • Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
    However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

牵你手 2024-09-09 22:30:09

如果能够在 Adob​​e Reader 中成功选择并复制文本(表明 PDF 确实包含文本对象),但您无法将复制的文本粘贴到记事本中而不使其看起来像一堆垃圾字符,那么问题是可能与所选文本使用的 CMap 有关。

PDF 规范提供了许多用于显示文本内容的选项以及文本内容的相关提取。 CMap 指定从字符代码到字符选择器的映射。 PDF 规范概述了一些预定义的 CMap,但也可以嵌入其他 CMap。

我的猜测是,该文本的 CMap 已损坏,或者 PDFBox 库不支持该特定的 CMap。我建议尝试不同的 SDK,看看是否会得到不同的结果。

If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.

The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.

My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.

第七度阳光i 2024-09-09 22:30:09

当在 Chrome(内部 PDF 浏览器)中作为 Gmail 附件打开时,复制会复制正常可读字符!

当我遇到这个问题时,它对我有用,对其他人也有用。我认为 Chrome PDF 查看器使用 Google 云端硬盘 OCR 自动识别...这就像魔法一样!

When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!

It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!

寄居人 2024-09-09 22:30:09

PDF是用什么创建的。有些 PDF 不包含任何编码信息,只包含绘制它的数据。所以没有办法提取数据。

What was the PDF created with. Some PDFs do not contain any encoding information, just the data to draw it. So there is no way to extract the data.

给妤﹃绝世温柔 2024-09-09 22:30:09

选择您要复制的文本。
右键单击
选择选项“将选择导出为”
在对话框中,选择文件名并将新文件另存为 RTF 格式
打开 RTF 查看您的文本!

Select the text you wish to copy.
Right click
Choose option "Export Selection as"
In the dialog box, choose a file name and save the new file as Rich Text Format (RTF)
Open RTF to see your text!

一抹淡然 2024-09-09 22:30:09

处理此问题的最佳方法是将文档另存为 JPEG(假设您有 Adob​​e Acrobat 或类似的东西,不确定 Reader 是否可以执行此操作)。然后将所有图像重新编译为单个pdf,然后使用OCR功能在页面中查找文本,然后您可以复制并粘贴文本。

The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.

孤单情人 2024-09-09 22:30:09

PDF 不是文本文档。它更像是一种矢量图形格式,有时可以包含文本。因此,除非您愿意进行 OCR,否则无法从某些文档中提取文本。事情就是这样。

PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文