Tesseract 生成可搜索的 PDF，深度从 8 位回到 1 位 (tess4j)

发布于 2025-01-14 18:09:17 字数 605 浏览 3 评论 0原文

我有一个 1 位颜色深度的 PDF 作为 OCR 处理的输入（tess4j，5.0.0），大约为。每个 30kb。处理后，每个PDF有120-130kb，并以8位颜色深度保存，这可能是文件大小增加的主要原因。

我想知道是否有一种方法可以在 Tesseract 或相关库中设置颜色深度，或者有另一种方法可以处理这个问题。

ITesseract instance = new Tesseract();
instance.setDatapath("/path/to/tessdata");
instance.setPageSegMode(ITessAPI.TessPageSegMode.PSM_SINGLE_COLUMN);
List<ITesseract.RenderedFormat> formats = new ArrayList<(Arrays.asList(ITesseract.RenderedFormat.PDF));
instance.createDocumentsWithResults(inputPdf.getPath(), "/path/to/result", formats, ITessAPI.TessPageIteratorLevel.RIL_WORD);

非常感谢任何帮助。

原文

I have a PDFs with 1-bit color depth as an input for OCR processing (tess4j, 5.0.0) with approx. 30kb each. After processing, each PDF has 120-130kb and is saved with 8-bit color depth, which is probably main cause of file size increase.

I would like to know if there is a way to set color depth within Tesseract or associated libs or there is another way to handle this.

ITesseract instance = new Tesseract();
instance.setDatapath("/path/to/tessdata");
instance.setPageSegMode(ITessAPI.TessPageSegMode.PSM_SINGLE_COLUMN);
List<ITesseract.RenderedFormat> formats = new ArrayList<(Arrays.asList(ITesseract.RenderedFormat.PDF));
instance.createDocumentsWithResults(inputPdf.getPath(), "/path/to/result", formats, ITessAPI.TessPageIteratorLevel.RIL_WORD);

Any help greatly appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

失而复得 2025-01-21 18:09:17

最终，我想出了一个解决方法 - 您可以通过定义 RendererFormat 来指定输出。我将其从 PDF 更改为 PDF_TEXTONLY，生成了一个 pdf (~7kb)，文本位于正确的位置，但没有原始扫描/图像。

List<ITesseract.RenderedFormat> formats = new ArrayList<>(Arrays.asList(ITesseract.RenderedFormat.PDF_TEXTONLY));

然后我使用 PDFBox 功能从原始 pdf 中提取图像。可以指定 DPI，这也有助于减小文件大小。

PDDocument document = PDDocument.load(inputPdf);
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
     BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
     ImageIOUtil.writeImage(bim, "/path/to/pics/picture_" + page + ".png", 300);
}
document.close();

然后只需将图像添加到纯文本 pdf 中作为水印 (如何使用 iText 在文本下方插入图像作为 pdf 背景？）。这有助于在 300 DPI 下将大小从 120-130 kb 减少到 60 kb（在较低 DPI 下甚至更小），考虑到这是一个 OCR 处理的 pdf，原始大小为 30kb，这非常好。我知道这不是最好的解决方案，我很乐意提供任何其他贡献或答案。

Eventually, I came up with a workaround - you can specify the output by defining RendererFormat. I changed that from PDF to PDF_TEXTONLY, which produced a pdf (~7kb) with the text in the right position but without the original scan/image.

List<ITesseract.RenderedFormat> formats = new ArrayList<>(Arrays.asList(ITesseract.RenderedFormat.PDF_TEXTONLY));

Then I used PDFBox functionality to extract image/images from original pdf. It is possible to specify DPI which also helps to reduce the file size.

PDDocument document = PDDocument.load(inputPdf);
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
     BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
     ImageIOUtil.writeImage(bim, "/path/to/pics/picture_" + page + ".png", 300);
}
document.close();

Then just add an image to the text-only pdf as a watermark (How to insert a image under the text as a pdf background using iText?). This helped reduce the size from 120-130 kb to 60 kb with 300 DPI (even less with lower DPI), which is great given that it is an OCR processed pdf with an original size of 30kb. I know this is not the best solution and I'll be happy for any other contribution or answer.

回复收藏 0 原文

~没有更多了~