PDFBox数字垃圾
我在使用PDFBOX提取文本时遇到了一些问题。我的PDF中有Tyep3嵌入字体,但是提取这部分时数字无法正常显示。有人可以给我一些指导吗?谢谢
我的版本是2.0.22
正确的输出是[USD-001],错误的输出是[USD-]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
我尝试使用PDFBOX将PDF转换为图像,发现一切正常。我只想将其作为普通文本
I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
The pdf file : http://tmp.link/f/6249a07f6e47f
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该文件的许多方面使文本提取困难。
首先,字体本身抵制了文本提取。在其 tounicode 流中,我们找到了映射:
即两个感兴趣的字符代码都映射到u+0000,而不是u+0030('0')和u+0031('1')正如他们本来应该的。
另外,编码完全没有帮助:
字形名称
/g121
和/g122
也没有标准化的含义。用于文本提取的PDFBOX可与字体的这两个属性一起使用,因此在这里失败。
另一方面,Adobe Acrobat在文本提取过程中还利用实际文本。
在文件中有这样的条目。但是,不幸的是,它们是错误的,例如Digit“ 0”:
bdc 指令只会期望一个名称和一个单词。因此,上述名称,字典,名称和字典的顺序是无效的。
因此,Adobe Acrobat也用来不提取此处的实际文本。直到最近,大概是在2022年初的发行版本,Acrobat才开始在这里提取“ 0”。
实际上,一个已知的“技巧”以防止普通文本提取器程序提取文本,是添加错误的 tounicode 和编码 信息,但正确 actuceactuceText 条目。
因此,文件中的错误实际上可能是此技巧的应用允许从Adobe Acrobat中复制“糊”。
There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
The glyph names
/g121
and/g122
don't have a standardized meaning either.PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.