如何从非 ASCII 编码的 PDF 中剪切粘贴?
我有一些 PDF,我正在尝试将其中包含的文本从 Acrobat Reader 剪切并粘贴到 HTML 表单中。似乎其中一些文件使用(我怀疑)unicode 进行文本编码,因此当我尝试粘贴到 HTML 表单(在 Firefox 上)时,我得到的是带有十六进制字符的小框,而不是可读的文本。问题不在于 PDF 尚未经过 OCRed - 当我尝试在 Acrobat Pro 中执行此操作时,它说不能,因为该文件已包含可渲染文本。有什么办法可以解决这个问题吗?例如,我可以在表单中添加某种 JavaScript 来进行转换吗?
I have some PDFs and I am trying to cut and paste text they contain from Acrobat Reader into an HTML form. It seems that some of these files use (I suspect) unicode for text encoding, so when I try to paste into the HTML form (on firefox) I get the little boxes with hex chars in them rather than readable text. The problem is not that the PDF has not been OCRed -- when I try to do that in Acrobat Pro it says it can't because the file already contains renderable text. Is there any way to deal with this? For example could I add some sort of javascript to the form that would do conversion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您是否能够将从文件复制的文本粘贴到其他程序(例如记事本、Word 或任何其他程序)中?
某些 PDF 文件在生成时没有特殊信息,而这些信息对于成功从中提取文本至关重要。即使使用 Adobe 工具也是如此。基本上,此类文件不包含字形到字符的映射信息。
此类文件可以正常显示和打印,但无法正确复制/提取其中的文本。
例如,当使用“最小文件大小”预设时,Distiller 会生成此类文件。
Are you able to paste text copied from the file into other programs like Notepad or Word or any other?
Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.
For example, Distiller produces such files when "Smallest File Size" preset is used.
我有同样的问题...确实在这里解释: http://forums.adobe.com/thread /915012
我的解决方案是使用Acrobat的导出工具将pdf转换为Word,然后从中提取我需要的信息。
这很令人沮丧,但确实有效。
我发现的另一个解决方案是将 pdf 转换为图像(jpeg、png 等),然后运行 OCR 过程。
I have the same problem... Indeed it is explained here: http://forums.adobe.com/thread/915012
My solution was to convert the pdf to Word using the Exporting Tool of Acrobat and then extract the information I need from it.
It's frustrating but that work.
Another solution that I find is to convert the pdf in images (jpeg, png, etc) and then run an OCR process.
文本很可能包含正确复制的字符,但由于缺乏合适的字体,您的浏览器无法显示它们。 PDF 文档可能包含嵌入字体,因此 Adobe Reader 可以正常显示字符,但浏览器无法访问这些字体。
您可以通过尝试复制并粘贴此处的字符来检查这是否是原因(无论如何,这可能是有关问题的有用信息)。您还可以下载并安装 Code200x 字体,其中几乎包含您通常会遇到的任何字符。 (不能保证,但很可能,Firefox 将能够在需要时自动使用这些字体。)
It is quite possible that the text contains characters that get copied correctly but your browser is unable to display them, due to lack of suitable font. A PDF document may contain embedded fonts, so Adobe Reader displays the characters OK, but a browser lacks access to those fonts.
You can check whether this is the reason by trying to copy and paste the characters here (it might be useful info about the problem anyway). You could also download and install the Code200x fonts, which contain pretty much any character you can normally expect to encounter. (It is not guaranteed, but probable, that Firefox will be able to use those fonts automatically when needed.)
我们在尝试将 PDF 文件中的西里尔字母复制/粘贴到 Excel 时遇到了类似的问题。
我们发现的最简单的解决方案是使用浏览器(Chrome、Mozilla 或 Opera)打开 .pdf,然后将文本复制/粘贴到 Word、Excel 中。
正如预期的那样,它不适用于 IE。
We had similar problem trying to copy/paste cyrillics from a PDF file into Excel.
The easiest solution we found was to open the .pdf with a browser (Chrome, Mozilla or Opera) and copy/paste the text in Word, Excel.
It didn't work with IE, as expected.
如果以上方法都不适合您,就像它对我不起作用一样,您可以截取 pdf 的屏幕截图并使用 Google Lens(在 Android 手机中)打开它,然后进入文本部分,AI 会检测文本自动,您可以根据需要复制它。
If none of the above works for you, as it didn't work for me, you can take a screenshot of the pdf and open it with Google Lens (in an android phone), then you go in text section and AI detects the text automatically and you can copy it if you want.
我遇到了同样的问题,但我通过使用网络浏览器(在我的例子中是chrome)打开PDF文件解决了这个问题。
复制粘贴非 ASCII 编码在 Chrome 中运行良好。
I had the same problem but I solved it by opening the PDF file with the web-browser (chrome in my case).
Copy-and-pasting non-ASCII encoding works fine in chrome.
您可以从 acrobat 导出为 jpeg,然后在 acrobat(而不是阅读器)中打开 jpeg,然后运行 OCR 工具。从那里您应该能够复制/粘贴。
You can export from acrobat as jpeg, then open the jpeg in acrobat (not reader) then run the OCR tool. From there you should be able to copy/paste.
我正在使用 Nitro Pdf。首先,我从 pdf 创建了 600 dpi 的图像。然后我在新的 pdf 文件中打开图像。然后在“审阅”选项卡中我使用了 OCR 选项。这将我带到另一个带有标准编码 pdf 文件的 pdf 文件,我可以在其中复制和粘贴文本。
I am using Nitro Pdf. 1st I created images at 600 dpi from pdf. Than I open image in an new pdf file. Then from Review tab I used OCR option. Which took me to another pdf file with standard encoded pdf file where I can copy and paste text.