相同图像的不同结果,具体取决于下载源
我正在构建一个刮板,以获取电报频道的图像,然后使用Tesseract进行文本。在测试过程中,我使用Telegram的Web界面(Windows 8.1,Chrome,右键单击,AS等)手动从频道下载了图像,并在它们上运行了Tesseract。
结果是完美的:
ocr_test = pytesseract.image_to_string(image).strip()
从那以后,我使用Telethon合并了Telegram侦听器,该电视侦听器从Telegram API中下载相同的图像。
这些图像的结果要糟糕得多。我正在使用同一PC,规格,环境,软件版本等。总共有30张图像,并且所有这些图像都发生在所有图像上。
是什么原因造成的?有办法解决吗?
我可以介绍预处理图像,但考虑到原始结果,这将很烦人。
I'm building a scraper to get images from a Telegram channel then using Tesseract to OCR the text. During testing I manually downloaded the images from the channel using Telegram's web interface (Windows 8.1, Chrome, right click, save as, etc) and ran Tesseract on them.
The results were perfect using a simple:
ocr_test = pytesseract.image_to_string(image).strip()
I have since incorporated the Telegram listener using Telethon which downloads the same images from the Telegram API.
The results for these images are much, much worse. I'm using the same PC, spec, environment, software versions, etc. There are 30 images in total and the issue occurs on all of them.
What causes this? Is there a way around it?
I can set about pre-processing the images but that would be annoying given the original results.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
它们不是相同的图像。
镀铬图像为449 x 800,API映像为719 x 1280。导致完全不同的字母大小。
此外,JPEG图像格式不适合OCR,并且在不同的图像尺寸上产生不同的伪影。
They are NOT the same images.
The Chrome image is 449 x 800, and the API image is 719 x 1280. That leads to totally different letter sizes.
Additionally, the jpeg image format is unsuitable for OCR and it produces different artifacts on different image sizes.