与 OCR 逆向工程作斗争

发布于 2025-01-01 23:20:05 字数 237 浏览 2 评论 0原文

我指的是基于软件的 OCR?图像到文本引擎转换工具,stackoverflow 有大量关于构建 OCR 的帖子,但我的看法相反,就像任何有关如何保护我的图像免受逆向工程影响的指南一样。

例如,我有仅包含文本的图像,我如何才能使任何人都难以解码数据,是否有任何所需的图像格式可以做到这一点?或者我们可以混淆图像?

使用特殊字体或变形能保证OCR保护吗?尽管我的要求不允许提供太多扭曲的文本。

任何方向都会非常有帮助

I am referring to software based OCR ?Image to text engine conversion tools, stackoverflow has tons of posting on building OCR but I am looking opposite, like any guidance on how to protect my images from reverse engineering.

For example i have images containing only texts, how can i make it difficult for anyone to decode the data, is there any desired image format which can do this? or we can obfuscate images?

Can using special fonts or distortion guarantee OCR protection? though my requirement do not allow too much of distorted text being served.

Any direction will be very helpful

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

相对绾红妆 2025-01-08 23:20:05

据我了解,您拥有一些受版权保护的文本集合,这些文本应该可供人类清晰阅读,但您不希望它以电子形式从您的服务器泄漏。我认为混淆文本使 OCR 变得更加困难并不是一个好主意,因为这会使人类无法阅读,尤其是当文本非常长时。基本上,人类容易阅读的内容都可以完美地进行 OCR 编辑。 OCR 难的事对人来说也难。在最坏的情况下,攻击者可能会聘请一家印度公司来手动重新输入文本,这实际上并没有那么昂贵。

我建议你寻找其他方面来做好保护。您的用例是什么样的?用户为什么可以在他们的电脑上以图像形式获取您的文本?他们只是下载 PDF 或图像文件吗?在这种情况下,阻止下载文件的可能性会更简单,而不是使其不可读。

例如,您可能会考虑不立即授予对整个文件的访问权限,而是逐页显示该文件,并需要人工交互才能到达下一页。您甚至可以扰乱您的网络界面,使其无法通过典型的站点下载实用程序下载所有内容。每个页面应显示在相同的 URL 上,但实际导航应使用 AJAX 甚至某些专有接口与服务器进行通信。

另一种方法是在人类不可见的每个页面上制作大量虚假链接,但它们会误导下载实用程序,使它们下载大量错误的内容,或者以错误的顺序下载,从而使其无法使用。

如果您能够成功地对抗自动下载,您甚至不必以图像形式提供内容,它可以是直接文本,但只是其中的一小部分。无论如何都会无法使用。

希望这能让您知道该走哪条路。

As I understand, you have a collection of some copyrighted text that should be clearly readable by humans, but you don't want it to leak from your server in electronic form. I don't think that it's a good idea to obfuscate text making it harder to OCR, since it will make it unreadable by humans, especially if texts are really long. Basically, what is easy to read for humans, can be perfectly OCR-ed. What is difficult to OCR is difficult for people too. In worst case, attacker may hire an Indian company to do manual retyping of text, this is not that expensive actually.

I would offer you to look for other aspects to make good protection. How does your use case look like? How come that users can get your texts as images on their PC? Do they download it just as PDF or image files? In this case it would be much simpler to fight against possibility to DOWNLOAD your files, instead of making it unreadable.

For example, you may think about not giving access to the whole file at once, but showing it page by page with human interaction required to get to the next page. You may even scramble your web interface to make it not possible to download everything by typical site download utilities. Each page shold be displayed on same URL, but actual navigation should be communicating with the server with AJAX or even some proprietary interface.

Another way is to make a lot of false links on every page not visible by humans, but they will mislead download utilities making them download tons of wrong content, or download it in wrong order making it unusable.

And if you will be successful in fighting against automated download, you won't even have to provide your content as an image, it can be straight text, but just small piece of it. It anyway will be unusable.

Hope that gives you some idea which way to go.

输什么也不输骨气 2025-01-08 23:20:05

正如我和其他人所说,使大量文本变得模糊以至于 OCR 无法读取它对于人类来说是不切实际的。

您是否有想要克服的特定威胁?简单的网络爬虫通常不执行 javascript,因此让文本更难抓取的一个愚蠢方法是使用 AJAX 请求加载文本并将其插入到 DOM 中。

或者,如果您想变得更强烈,您可以将文本显示在 Flash 或 Silverlight 控件中 - 仍然无法进行 OCR 验证,但这将使自动抓取大量文本变得非常重要,特别是如果您有Flash 滚动条和/或分页。 (我应该指出,用于文本等简单内容的 Flash 控件听起来使用起来很烦人,无法搜索或添加书签,并且显然不适用于大多数移动设备。)

As I and others have said, making a large amount of text obscure enough that OCR can't read it will make it impractical for humans.

Is there a specific threat you're trying to beat? Simple web crawlers often don't execute javascript, so a dumb way to make your text harder to scrape would be to load it with an AJAX request and insert it into the DOM.

Or if you want to get more intense, you could have the text displayed in a Flash or Silverlight control -- still not OCR-proof, but that would make it non-trivial to automatically grab large amounts of text, particularly if you have a Flash scrollbar and/or pagination. (I should point out that Flash controls for something simple like text sounds annoying to use, won't be searchable or bookmarkable, and obviously won't work on the majority of mobile devices.)

最好是你 2025-01-08 23:20:05

我认为你做不到。对于验证码,是的,并且有大量的研究,但您也会从个人经验中知道它们读起来有多么烦人。对于较长的文本这是不可能的。不过,我会严重质疑这里的用例或商业模式。您的某些内容由于某种原因需要 OCR 保护。这意味着有人愿意花费资源来 OCR 您的内容。你为什么要和那些人战斗?让他们成为客户并以纯文本形式提供内容并收取一定费用。如果该费用低于 OCR 成本,那么您就实现了双赢。你试图实施的事情听起来像是双输。

I do not think you can do that. For CAPTCHA, yes, and there is tons of research, but you will also know from personal experience how annoying they are to read. For longer text it is impossible. I would seriously question the use case or business model here though. You have some content that for some reason needs protection from OCR. That means somebody would be willing to spend resources to OCR your content. Why would you fight those people? Make them a customer and offer the content in plain text for some fee. If that fee is less than their OCR cost, you have a win-win. What you are trying to implement sounds like a lose-lose.

撩动你心 2025-01-08 23:20:05

我见过一些页面通过在文本中使用不可见的字母和其他“噪音”来混淆文本。这样您仍然可以将其显示为文本,同时使其更难以复制。

另一个想法可能是以某种方式在文本上加水印,以识别“被盗”副本的来源。这是否有用取决于您想要受到保护的具体内容。正如已经提到的,如果它是可读的,那么有人可以手动复制它。

I have seen some pages obfuscating text by using invisible letters and other "noise" in the text. This way you can still display it as text, while making it a lot harder to copy.

Another idea might be to watermark the text in some way to recognize from where a "stolen" copy came from. If this is useful depends on exactly what you want to be protected from. As has already been mentioned, if it is readable, someone could manually copy it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文