Scrapy 中的验证码
我正在开发一个 Scrapy 应用程序,我正在尝试使用使用验证码的表单(这不是垃圾邮件)登录到一个网站。我正在使用 ImagesPipeline 下载验证码,并将其打印到屏幕上供用户解决。到目前为止,一切都很好。
我的问题是如何重新启动蜘蛛以提交验证码/表单信息?现在,我的蜘蛛请求验证码页面,然后返回一个包含验证码的 image_url
的 Item
。然后由 ImagesPipeline
处理/下载,并显示给用户。我不清楚如何恢复蜘蛛的进度,并将已解决的验证码和相同的会话传递给蜘蛛,因为我相信蜘蛛必须在 ImagesPipeline 进入之前返回项目(例如退出)工作。
我浏览了文档和示例,但没有找到任何文档和示例可以清楚地说明如何实现这一点。
I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline
to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item
containing the image_url
of the captcha. This is then processed/downloaded by the ImagesPipeline
, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha
and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这就是您可能让它在蜘蛛内部工作的方式。
收到请求后,暂停引擎,显示图像,读取用户的信息&通过提交登录 POST 请求来恢复爬网。
我很想知道该方法是否适合您的情况。
This is how you might get it to work inside the spider.
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
我不会创建一个 Item 并使用 ImagePipeline。
...
我在这里所做的是导入
urllib.urlretrieve(url
) (用于存储图像),os.remove(file)
(用于删除上一个图像) )和 subprocess.checoutput (调用外部命令行实用程序来解决验证码)。整个 Scrapy 基础设施并没有在这个“黑客”中使用,因为像这样解决验证码始终是一个黑客。整个调用外部子进程的事情本来可以更好,但这是可行的。
在某些网站上,无法保存验证码图像,您必须在浏览器中调用该页面并调用 screen_capture 实用程序并在确切位置进行裁剪以“剪切”验证码。现在就是屏幕抓取。
I would not create an Item and use the ImagePipeline.
...
What I do here is that I import
urllib.urlretrieve(url
) (to store the image),os.remove(file)
(to delete the previous image), andsubprocess.checoutput
(to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.