为了崇高的目的而破解验证码
要求用户阅读扭曲文本的 CAPTCHA 对于视力正常的人来说很好,但对于那些视力不佳的人来说是一个可怕的障碍失明或有其他残疾。有时可以使用音频替代品,但仍然无法帮助那些既聋又盲的人,并且可能很难与屏幕阅读器一起使用(屏幕阅读器已经在向您朗读文字)。
存在一些使用人工代表用户解决验证码的解决方案,例如 WebVisium 和 < a href="http://www.solona.net/learn/how.php" rel="noreferrer">Solona,但这些依赖于志愿者操作员的可用性(例如,Solona 显然只有一个志愿者,所以当你需要帮助时你必须希望他醒着)。
我发现盲人所需的验证码解决方案的数量非常少 - 我猜在像英国这样的人口大国每天不到几百个。这意味着,与想要在短时间内多次执行某项操作的坏人不同,针对盲人的验证码辅助服务可以投入大量的计算资源 - 例如,Amazon EC2 - 识别所显示的文本。
我的问题是这样的:假设您不太关心速度,并且您有很多可用的计算机,是否有算法可以让您解决当今常见的文本扭曲验证码,例如 reCaptcha?或者即使有大量的资源和时间,这些问题真的很难解决吗?
几点说明:
在这一点上,我的问题只是理论上的,但显然任何此类服务都必须仔细控制访问以阻止垃圾邮件发送者。也许只有注册的盲人才被允许使用它。
我知道旧的雅虎验证码已损坏几年前,使用一种在单台计算机上几秒钟内运行的算法。我在问现代验证码是否可以被破解,也许更慢并且需要更多资源。
CAPTCHAs that ask users to read distorted text are fine for sighted people, but a terrible barrier for those who are blind or have other disabilities. Audio alternatives are occasionally available but still don't help those who are both deaf and blind and can be hard to use with a screenreader (which is already reading words to you).
There exist a couple of solutions that use humans to solve the CAPTCHA on behalf of the user, such as WebVisium and Solona, but these rely on the availability of volunteer operators (for example, Solona apparently has just one volunteer so you have to hope he is awake when you want help).
It occurs to me that the volume of CAPTCHA solutions needed by blind people is very low - I'd guess less than a few hundred per day in a populous country like the UK. This means that unlike the bad folks who want to perform an action many times in a short period, a CAPTCHA assistance service for blind people could afford to devote considerable computational resource - for example, a cloud of computers in Amazon EC2 - to identifying the presented text.
My question is this: assuming you don't care about speed very much, and you have lots of computers available, are there algorithms that let you solve the text-distortion CAPTCHAs that are common today, such as those used by reCaptcha? Or are these problems really intractable even with lots of resource and time?
A few notes:
At this point, my question is just theoretical, but clearly any such service would have to carefully control access to keep spammers out. Perhaps only registered blind people would be allowed to use it.
I am aware that an old Yahoo CAPTCHA was broken a few years ago using an algorithm that runs in seconds on a single computer. I am asking whether modern CAPTCHAs can be broken, perhaps more slowly and with more resource.
I am aware that some new CAPTCHA types are appearing, which ask users to identify kittens or orient a picture. These aren't widespread yet, so I'm just asking about text-distortion for now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
基本上解决文本失真验证码由三个单独的步骤组成:
剩下的对计算机来说相当困难的问题是第二个问题。第一个通常并不难,除非您碰巧偶然发现 来自地狱的验证码。第三个问题由计算机解决,其成功率比人类高得多。
OCR 研究团队。
Basically solving a text distortion CAPTCHA consists of three individual steps:
The only problem that's left which is pretty hard for computers is the second one. The first usually isn't very hard, unless you happen to stumble upon the CAPTCHA from hell. And the third gets solved by computers with a much better success rate than by humans.
An interesting site for learning how CAPTCHAs are broken is the one by the OCR Research Team.
创建验证码是为了避免机器检测到这些单词。它仅供人类阅读。让盲人/聋人更容易阅读会增加机器再次理解它们的风险,从而抵消它们的效果。
不过,垃圾邮件发送者确实找到了一种非常有效的方法来破解更流行的验证码。他们只是雇佣廉价劳动力来阅读它们,以换取每个工作帐户几美分的报酬。因此,出现了一个围绕破解验证码创建数百万个帐户的小行业,这些帐户随后可用于发送更多垃圾邮件。与垃圾邮件发送者获得的收益相比,成本几乎为零。盲人/聋人也可以使用类似的解决方案,他们将验证码图像发送给中国或其他地方的一些廉价劳动力,他们会用正确的单词回复,盲人/聋人将能够继续。不幸的是,盲人只需要几次这种服务,而垃圾邮件发送者需要持续不断的流量,因此这些劳动者宁愿为垃圾邮件发送者工作。 (报酬更好。)不过,最好的解决方案是将验证码发送给一些朋友,让他们阅读和/或破译它并返回答案。
ReCAPTCHA 风格也会读出单词。一个简单的语音识别应用程序可能能够识别所说的任何内容,尽管语音识别仍然需要更多优化。不过,您可能希望从这个角度开始工作,让应用程序监听声音字节。
当有可能破解验证码时,他们只会想到更好的类似验证码的方法。 OCR 技术仍在不断改进,因此需要做更多的工作来提高验证码的难度。也就是说,直到 OCR 在识别单词方面变得与人眼一样好……
可以创建一种算法,尽管速度很慢。有26个小写字母、26个大写字母和10个数字,想出一个算法应该不会太难。不过,对于 Serif 和 Sans-serif 字体,组合数量需要加倍。不过,如果您尝试以与验证码中的字母类似的方式弯曲所有字母,您应该能够检测到被验证码字母覆盖最多的字母。那将是最有可能的候选人。仍然需要清除图像中的线条、污垢和其他伪影,而人眼比计算机更容易识别这些伪影。您需要执行以下步骤:
3a.通过检查左侧确定字母的曲线。
3b.对每个可能的字母/数字进行叠加,找到最能覆盖它的字母/数字。 (这是最有可能的字母。)
尽管他们可以扭曲验证码中的字母,但只需查看每个字母的左侧然后尝试应用,就应该可以检测到他们使用的扭曲旋转。每个字母都有相同的曲线。 (52 个组合,如果还使用数字,则加上 10 个数字。)基本上,您会尝试在每个字母周围放置一个框,然后检查哪个字母包含最少的空格。这是最有可能的信。
OCR 不经常使用这种方法的主要原因基本上是对速度的需要。步骤 3a/b 往往会很慢,尤其是当您必须考虑字体样式时。
Making this answer bigger but in reply to one of the comments:
有多种方法可以清理图像。您需要一些颜色过滤、降噪和能够识别图像中的噪声线的算法。您指向的 DEFCON 幻灯片显示了一些简单的技术来过滤掉一些的噪音。它表明基本的图像处理工具已经可以使图像更加清晰以供机器读取。简单的模糊可以清理随机的点和细线,而滤色镜可以过滤掉嘈杂的颜色。下一步是尝试在验证码中的每个字母周围放置一个方框,希望系统能够识别它们的位置。我不知道任何实用的算法,但应该有方法来识别它们。有软件可以从位图创建矢量图像,因此应该有软件能够计算字母周围的框。
这个盒子很可能没有矩形角,因此您必须扭曲所有 52 个字母才能匹配同一个盒子。斜体或粗体应该没有太大区别,因为这些样式只是额外的扭曲。不过,衬线或无衬线确实有所不同。衬线字体往往有更多的尖刺和装饰。幸运的是,有一些算法可以将盒子转换为任何其他有四个角的图形。
常规 OCR 应用程序会假设字母大多是直的,并且只会检查一些热点来查找匹配项。因此,他们有时会因为噪音而出错。要破解验证码,您需要更灵敏的匹配,最好将验证码字母图像与 52 个字母之一的图像进行“异或”,然后计算黑白点的数量来计算比率。假设白色=1,黑色=0,则异或的结果应该几乎是黑色以获得最佳匹配。
我认为一些垃圾邮件发送者已经找到了一些有用的算法来破解验证码,但对他们来说,对这些算法保密只会让他们继续营业。
Another comment, more text. :-)
细分会是一个问题,但并非不可能解决。它非常复杂。但是当你清理图像后,应该可以计算出两条线。一行接触每个字母的底部,第二行接触顶部。然而,好的验证码不会再将字母放在同一行上,但那些不太好的验证码只需沿着这些行就可以破解。 (猜猜看?ReCAPTCHA 将字母放在两行之间!)对于两行,您知道第一个字母将从左侧开始,因此您可以尝试在那里覆盖所有 52 种可能性,直到找到匹配项。找到一个后,向右移动寻找第二个。进一步直到您读完所有信件。有两条线来引导您,您不需要完整的盒子。
字母往往使用恒定的宽度和高度比率。通过两行,您可以计算出完整字母的高度,从而很好地估计匹配的宽度。
尽管如此,制定正确的算法来计算这一切对于我糟糕的数学技能来说有点太多了。你需要一位专业的数学家来破解这个算法。
CAPTCHA has been created to avoid machines from detecting the words. It's meant to be read by humans only. Making it more readable for blind/deaf people adds a risk of machines being able to understand them again, thus nullifying their effect.
Spammers did find a very effective way to break the more popular CAPTCHA's though. They just hire cheap labourers to read them, in return for a few cents per working account. As a result, there's a small industry around breaking CAPTCHA's to create millions of accounts that can then be used to send more spam. Compared to the amount gained by the spammers, the costs is almost none. A similar solution could be used by blind/deaf people, who would send the CAPTCHA image to some cheap labourer in China or wherever, where they will reply with the correct words and the blind/deaf person will be able to proceed. Unfortunately, blind people only need this service only a few times while spammers need a continuous flow, thus those labourers will prefer to work for spammers instead. (The pay is better.) Still, the best solution would be to send the CAPTCHA to some friend, let them read and/or decipher it and return the answer.
The ReCAPTCHA style also reads out the words. A simple speech recognition application might be able to recognise whatever is said, although speech recognition still needs more optimizations. Still, you might want to work from that angle, getting the application to listen to the sound byte instead.
When it is possible to break CAPTCHA's, they will just think of better CAPTCHA-like methods. OCR techniques are still improving thus more work will be done to make CAPTCHA's harder. That is, until OCR has become as good as the human eye at recognizing words...
An algorithm could be created, although slow. With 26 lowercase and 26 uppercase letters and 10 digits, it should not be too difficult to come up with an algorithm. With Serif and Sans-serif fonts, the number of combinations would need to be doubled, though. Still, if you try to curve all letters in a similar way as the letter in the CAPTCHA, you should be able to detect a letter which gets covered by the CAPTCHA letter the most. And that would be the most likely candidate. Still needs you to clear lines, dirt and other artefacts from the image that the human eye has less trouble to recognise than a computer. You'd need the following steps:
3a. Determine the curve of the letter by checking the left side.
3b. Do an overlay of every possible letter/digit to find the one that covers it the best. (That's the most likely letter.)
Even though they can twist the letters in the CAPTCHA's, it should be possible to detect the twist rotation that they used simply by looking at the left side of every letter and then trying to apply the same curve to every letter. (52 combinations, plus 10 digits, if digits are also used.) Basically, you'd try to put a box around every letter, then check which letter will contain the least amount of white space. That's the most likely letter.
The main reason why this isn't often used for OCR is basically the need for speed. Step 3a/b tends to be slow, especially if you have to take font style in consideration.
Making this answer bigger but in reply to one of the comments:
There are several ways to cleanup an image. You'd need some color filtering, noise reduction and an algorithm that's able to recognise the noisy lines through an image. The DEFCON slideshow that you've pointed to shows a few simple techniques to filter away some of the noise. It shows that a basic image processing tool can already make an image a lot clearer for a machine to read. A simple blur will clean up random dots and thin lines while color filters would filter away the noisy colours. A next step would be to try to put a box around every letter in the CAPTCHA, hoping the system is able to recognise their locations. I don't know any practical algorithms for this but there should be ways to recognise them. There's software that can create vector images from bitmaps, thus there should be software that's able to calculate a box around a letter.
It is likely that this box won't have rectangular corners, thus you would have to distort all 52 letters to match the same box. Italic or bold shouldn't make much of a difference since these styles are just additional distortions. Serif or Sans-serif does make a difference, though. Serif fonts tend to have a few more spikes and ornaments. Fortunately, there are algorithms that can transform a box to any other figure with four corners.
Regular OCR applications will assume that letters are mostly straight and will just check a few hotspots to find a match. Thus, they sometimes get it wrong because of noise. To crack CAPTCHA, you would need a more sensitive match, preferably "XOR-ing" the CAPTCHA letter image with an image of one of the 52 letters, then counting the number of black and white spots to calculate the ratio. Assuming white=1 and black=0, the result of the XOR should be almost black for the best match.
I think several spammers have already found some useful algorithms to crack CAPTCHA's but for them, keeping these algorithms a secret just keeps them in business.
Another comment, more text. :-)
Segmentation would be a problem, but it's not impossible to solve. It's just extremely complex. But when you've cleaned the image, it should be possible to calculate two lines. One line that touches the bottom of every letter and a second line that touches the top. However, good CAPTCHA's won't put letters on the same lines any more, but those not-so-good ones could be cracked by just following the lines. (Guess? ReCAPTCHA puts letters between two lines!) With two lines, you know the first letter will start at the left, thus you can try overlaying all 52 possibilities there until you've found a match. When you found one, move to the right for the second one. And further until you've read all letters. With two lines to guide you, you don't need a complete box.
Letters tend to use a constant ratio between width and height. With two lines, you can calculate the height of the complete letter and thus get a good estimation of the matching width.
Still, working out the correct algorithm to calculate this all is a bit too much for my poor math skills. You'd need an expert mathematician to crack this algorithm.
我对你的问题的回答是“即使有大量的资源和时间,这些问题真的很难解决吗?”需要指出的是,这就是验证码起作用的原因。
我的理解是,验证码的目的是证明你是人类而不是垃圾邮件机器人。 reCAPTCHA 是对这一主题的新颖尝试,因为它们拍摄的图像代表的是 OCR(光学字符识别)引擎无法解析的文本。在这种情况下,人和机器之间的区别在于,专门的算法试图解释该图像并失败,而“正常”人具有以一致的人类方式解释文本的内在能力。话虽这么说,未来我们希望有人能提出更好的 OCR 引擎,以便在数字化世界信息时需要更少的人为干预。我们希望有人能够针对这个特定问题提出一个易于处理的解决方案。
从您的角度来看,试图让盲人更容易使用验证码(他们仍然需要证明自己是人而不是垃圾邮件机器人),社区需要意识到这个问题,并找到一种方法来识别人们的身份一种不太以视觉为中心的方式。
My answer to your question "are these problems really intractable even with lots of resource and time?" is to point out that this is the very reason that CAPTCHAs work.
My understanding is that the purpose of a CAPTCHA is to prove that you are human rather than a spam bot. reCAPTCHAs are a novel take on this theme because they take images that represent text that cannot be resolved by OCR (optical character recognition) engines. The difference between a person and a machine in this instance is that specialized algorithm(s) has tried to interpret this image and failed while a "normal" person has the intrinsic ability to interpret the text in a consistently human way. That being said, in the future we hope that someone will come up with better OCR engines so that there needs to be less human intervention in digitizing the worlds information. We hope that someone will come up with an tractable solution to this particular problem.
From your point of view of trying to make CAPTCHAs more accessible to blind people -- who still need to prove that they're people rather than spam bots -- the community needs to become aware of this issue and find a way to identify people in a less vision centric way.
验证码的引入无疑使视障人士更难访问网络,我同意您的说法,这是一个值得更多关注和关注的重大问题。然而,虽然验证码在流行网站上可能并且已经被不一致地绕过,但我认为这对于有需要的人来说不是一个可行的长期解决方案。事实上,当前存在于 Facebook、Google、MySpace 等网站上的验证码变体能够可靠且一致地被破解的那一天,就是它们将变得过时和被放弃的那一天,因为它们要么是相同解决方案的更强变体,要么是全新的解决方案(如您所见)暗示,区分图片中的猫和狗已经成为一种流行的替代趋势)。
当谈到在线无障碍时,我认为残疾人现在最需要的是宣传。越多的人联系软件公司、开源组织和标准机构并公开谈论这一需求,就会提高更多的认识,并且(希望)会导致代表开发社区采取更多行动。最终,如果看到 Google 或 Facebook 等网站为其视障用户提供替代访问方法,那就太好了。
抛开理想主义不谈,我认为寻求其他途径是富有成效的,就像你提到的验证码志愿者网络一样,甚至可能为那些有相关残疾的人开发像 OpenID 这样的东西作为通用表单验证通行证。
至于您问题的技术方面,我认为仅凭额外的处理能力并不能让您可靠且一致地破解验证码。垃圾邮件中蕴藏着大量金钱,您可以确信,阴暗的 SEO 公司和垃圾邮件发送者都拥有大量服务器可供使用。正如 Johannes Rössel 提到的,如果您想了解更多有关如何完成此操作以及技术难度所在的信息,请研究光学字符识别 (OCR) 并查看高流量网站上发生的各种数字/字母倾斜。
The introduction of CAPTCHA has certainly made the web less accessible to the visually impaired, and I agree with you in citing this as a significant problem that deserves more attention and concern. However, while CAPTCHA can be and has been inconsistently bypassed on popular web sites, I don't think this is a viable long-term solution for those in need. Indeed, the day that the CAPTCHA variants currently present on sites like Facebook, Google, MySpace etc. can be reliably and consistently broken is the day they will become obsolete and abandoned for either stronger variants of the same or an entirely new solution (as you implied, distinguishing cats from dogs in pictures has been a popular alternative trend).
When it comes to online accessibility, what I think those with disabilities need most right now is advocacy. The more people contact software companies, open source groups, and standards bodies and speak out about this need, the more awareness will be raised and that will (hopefully) lead to more action on behalf of the development community. Ultimately, it would be great to see sites like Google or Facebook offering alternative access methods just for their visually impaired users.
Idealism aside, I think it is productive to pursue other avenues like you mentioned with the CAPTCHA volunteer network, possibly even the development of something like OpenID for those with relevant disabilities as a universal form validation pass.
As for the technical aspect of your question, I don't think the availability of additional processing power alone will allow you to reliably and consistently break CAPTCHA. There is A LOT of money in spam, and you can be sure that shady SEO companies and Spammers alike have a great number of servers at their disposal. As Johannes Rössel mentioned, if you want to learn more about how this is done and where the technical difficulty lies, research Optical Character Recognition (OCR) and look at the wide variety of number/letter skewing that occurs on high traffic sites.
这个相关的SO问题有很多好主意,其中包括一个 DEFCON 演讲,声称使用多个 OCR 和投票会破坏许多简单的验证码。这提出了一种候选解决方法:将问题分布在多个服务器上,每个服务器并行运行一个或多个 OCR 工具,收集结果,并采用最流行的答案。欢迎评论。
This related SO question has a number of good ideas in it, including a DEFCON talk that claims using multiple OCRs and voting breaks many simple CAPTCHAs. This suggests a candidate solution method: distribute the problem over several servers, each of which runs one or more OCR tools in parallel, collect the results, and take the most popular answer. Comments welcome.