下载验证码图像的脚本
为了完全非恶意的目的 - 特别是机器学习,我想下载一个巨大的验证码图像数据集。然而,CAPTCHA 总是使用一些模糊的 javascript 来实现,这使得在没有浏览器的情况下获取实际图像成为一项艰巨的任务,至少对我这个 javascript 新手来说是这样。
那么,任何人都可以给我一些关于如何完全在浏览器之外使用脚本下载模糊单词图像的有用指示吗?请不要向我指出已收集的模糊单词的数据集 - 我需要从特定网站收集用于此特定实验的图像。
谢谢!
编辑:提出这个问题的另一种方式非常简单。当您在具有复杂 JavaScript 的网站上单击“查看源代码”时,您会看到脚本引用,但这就是您看到的全部。但是,如果您单击“将网页另存为...”(在 Firefox 中),然后查看保存的网页的源代码,JavaScript 将被解析并生成新的 html 和图像(至少在ASIRRA 和 reCAPTCHA 的情况)在源中。如何使用脚本模仿这种“将网页另存为...”行为?一般来说,这是一个重要的网络编码问题,所以请停止质疑我的动机!从现在开始,我可以在所有涉及脚本编写的 Web 开发中使用这些知识,我相信其他堆栈溢出访问者也可以!
For completely non-nefarious purposes - machine learning specifically, I'd like to download a huge dataset of CAPTCHA images. However, CAPTCHA is always implemented using some obfuscated javascript that makes getting at the actual images without a browser a non-trivial task, at least to me, who is a javascript novice.
So, can anyone give me some helpful pointers on how to download the image of the obscured word using a script completely outside of a browser? And please don't point me to a dataset of already collected obscured words - I need to collect the images from a specific website for this particular experiment.
Thanks!
Edit: Another way this question could be asked is very simple. When you click "view source" on website with complicated javascript, you see the script references, but that's all you see. However, if you click "save webpage as..." (in firefox) and then view the source of the saved webpage, the javascript will be resolved and new html and the images (at least in the case of ASIRRA and reCAPTCHA) is in the source. How can I mimic this "save webpage as..." behavior using a script? This is an important web coding question in general, so please stop questioning me on my motives with this! This is knowledge I can use from now on in all web development involving scripting and I'm sure other stack overflow visitors can as well!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在等待答案的同时,我不断挖掘并最终找到了一种完成我想要的事情的方法。
首先,这是一个有点复杂的问题(至少对于像我这样的 javascript 新手来说),因为 ASIRRA 中的图像是通过 javascript 加载到网页上的,这是一种客户端技术。当您使用 wget 或 curl 之类的工具下载网页时,这是一个问题,因为它实际上并不运行 javascript,它只是下载源 html。因此,您无法获得图像。
然而,我意识到使用 Firefox 的“页面另存为...”正是我所需要的。它运行加载图像的 JavaScript,然后将其全部保存到我的硬盘驱动器上众所周知的目录结构中。这正是我想要自动化的。所以...我找到了一个名为“iMacros”的firefox插件并编写了这个宏:
设置循环10,000次,它运行得很好。事实上,由于它总是保存到同一个文件夹,重复的图像被覆盖(这就是我想要的)。
While waiting for an answer here I kept digging and eventually figured out a sort of hacked way of getting done what I wanted.
First off, the reason this is a somewhat complicated problem (at least to a javascript novice like me) is that the images from ASIRRA are loaded onto the webpage via javascript, which is a client-side technology. This is a problem when you download the webpage using something like wget or curl because it doesn't actually run the javascript, it just downloads the source html. Therefore, you don't get the images.
However, I realized that using firefox's "Save Page As..." did exactly what I needed. It ran the javascript which loaded the images, and then it saved it all into the well-known directory structure on my hard drive. That's exactly what I wanted to automate. So... I found a firefox Add-on called "iMacros" and wrote this macro:
Set to loop 10,000 times, it worked perfectly. In fact, since it was always saving to the same folder, duplicate images were overwritten (which is what I wanted).
为什么不自己获取验证码并生成图像呢? reCAPTCHA 也是免费的。
http://www.captcha.net/
更新:我看到您想要从特定站点获取它,但是如果您拥有自己的图像,您可以对其进行调整以提供与您所定位的网站相同类型的图像。
Why not just get CAPTCHA yourself and generate images? reCAPTCHA's free too.
http://www.captcha.net/
Update: I see you want it from a specific site but if you get your own you can tweak it to give the same kind of images as the site you're targeting.
与运行该网站的人员联系并索取数据集。如果您尝试以任何可疑的方式下载许多图像,您很快就会进入他们的杀戮名单,这意味着您将无法再从他们那里得到任何东西。
验证码旨在保护人们免受滥用,从他们的角度来看,您所做的事情看起来像是滥用。
Get in contact with the people who run the site and ask for the dataset. If you try to download many images in any suspicious way, you'll end up on their kill list rather quickly which means that you won't get anything from them anymore.
CAPTCHAs are meant to protect people against abuse and what you do will look like abuse from their point of view.