相同图像的不同结果，具体取决于下载源

发布于 2025-02-11 08:34:29 字数 417 浏览 1 评论 0原文

我正在构建一个刮板，以获取电报频道的图像，然后使用Tesseract进行文本。在测试过程中，我使用Telegram的Web界面（Windows 8.1，Chrome，右键单击，AS等）手动从频道下载了图像，并在它们上运行了Tesseract。

结果是完美的：

ocr_test = pytesseract.image_to_string(image).strip()

从那以后，我使用Telethon合并了Telegram侦听器，该电视侦听器从Telegram API中下载相同的图像。

这些图像的结果要糟糕得多。我正在使用同一PC，规格，环境，软件版本等。总共有30张图像，并且所有这些图像都发生在所有图像上。

是什么原因造成的？有办法解决吗？

我可以介绍预处理图像，但考虑到原始结果，这将很烦人。

原文

I'm building a scraper to get images from a Telegram channel then using Tesseract to OCR the text. During testing I manually downloaded the images from the channel using Telegram's web interface (Windows 8.1, Chrome, right click, save as, etc) and ran Tesseract on them.

The results were perfect using a simple:

ocr_test = pytesseract.image_to_string(image).strip()

I have since incorporated the Telegram listener using Telethon which downloads the same images from the Telegram API.

The results for these images are much, much worse. I'm using the same PC, spec, environment, software versions, etc. There are 30 images in total and the issue occurs on all of them.

What causes this? Is there a way around it?

I can set about pre-processing the images but that would be annoying given the original results.

分享到QQ

分享到微博