使用 Camelot 进行错误编码

发布于 2025-01-11 01:16:28 字数 729 浏览 1 评论 0原文

我正在使用 Camelot 来解析文档。为了简单起见，我现在使用最基本的命令进行调试：

all_pages = camelot.read_pdf(str(file_path))
for table_info in all_pages:
    df = table_info.df
    print(df)

我将其应用于两个不同的 PDF，它们看起来非常相似。它们的元数据相同：

制作者：Acrobat Distiller 17.0 (Windows)
创建者：PScript5.dll 版本 5.2.2
格式：PDF-1.3
尺寸：A4，纵向（210 × 297 毫米）

仅文档的日期和大小不同。它们包含一个具有相同布局的表格。它仅在尺寸上略有变化。甚至单元格内的数据都是相同的！（我无法附加 PDF，但这里有 jpg 版本）：

对于较旧的 PDF 文件，一切进展顺利，我得到了单词、数字等。但对于较新的 PDF 文件，我只得到了得到奇怪的编码之类的东西“（cid：12）（cid：13）（cid：14）”。

我已查看文档，但找不到与此问题或一般编码相关的任何内容。

原文

I am using Camelot to parse a document. To keep it simple, I am now debugging with the most basic command:

all_pages = camelot.read_pdf(str(file_path))
for table_info in all_pages:
    df = table_info.df
    print(df)

I am applying this to two different PDFs, which look very much the same. Their metadata is identical:

Producer: Acrobat Distiller 17.0 (Windows)
Creator: PScript5.dll Version 5.2.2
Format: PDF-1.3
Size: A4, Portrait (210 × 297 mm)

Only the date and size of the documents are different. They contain a table, with the same layout. It only changes slightly in size. Even the data within cells is the same! (I can't attach a PDF, but here is a jpg version):

With the older PDF file things go well, and I get words, numbers, etc. But with the newer one I only get weird encoding stuff like "(cid:12)(cid:13)(cid:14)".

I have looked through the documentation, but I can't find anything related to this problem or to encoding in general.

分享到QQ

分享到微博