Scrapy 中的验证码

发布于 2024-11-19 12:40:08 字数 387 浏览 2 评论 0原文

我正在开发一个 Scrapy 应用程序，我正在尝试使用使用验证码的表单（这不是垃圾邮件）登录到一个网站。我正在使用 ImagesPipeline 下载验证码，并将其打印到屏幕上供用户解决。到目前为止，一切都很好。

我的问题是如何重新启动蜘蛛以提交验证码/表单信息？现在，我的蜘蛛请求验证码页面，然后返回一个包含验证码的 image_url 的 Item。然后由 ImagesPipeline 处理/下载，并显示给用户。我不清楚如何恢复蜘蛛的进度，并将已解决的验证码和相同的会话传递给蜘蛛，因为我相信蜘蛛必须在 ImagesPipeline 进入之前返回项目（例如退出）工作。

我浏览了文档和示例，但没有找到任何文档和示例可以清楚地说明如何实现这一点。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哥，最终变帅啦 2024-11-26 12:40:08

这就是您可能让它在蜘蛛内部工作的方式。

self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()

收到请求后，暂停引擎，显示图像，读取用户的信息&通过提交登录 POST 请求来恢复爬网。

我很想知道该方法是否适合您的情况。

This is how you might get it to work inside the spider.

self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()

Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.

I'd be interested to know if the approach works for your case.

回复收藏 0 原文

空城之時有危險 2024-11-26 12:40:08

我不会创建一个 Item 并使用 ImagePipeline。

import urllib
import os
import subprocess

...

def start_requests(self):
    request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
    return [request]      

def fill_login_form(self,response):
    x = HtmlXPathSelector(response)
    img_src = x.select("//img/@src").extract()

    #delete the captcha file and use urllib to write it to disk
    os.remove("c:\captcha.jpg")
    urllib.urlretrieve(img_src[0], "c:\captcha.jpg")

    # I use an program here to show the jpg (actually send it somewhere)
    captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")

    # OR just get the input from the user from stdin
    captcha = raw_input("put captcha in manually>")  

    # this function performs the request and calls the process_home_page with
    # the response (this way you can chain pages from start_requests() to parse()

    return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]

    def process_home_page(self, response):
        # check if you logged in etc. etc.

...

我在这里所做的是导入 urllib.urlretrieve(url) （用于存储图像），os.remove(file) （用于删除上一个图像））和 subprocess.checoutput （调用外部命令行实用程序来解决验证码）。整个 Scrapy 基础设施并没有在这个“黑客”中使用，因为像这样解决验证码始终是一个黑客。

整个调用外部子进程的事情本来可以更好，但这是可行的。

在某些网站上，无法保存验证码图像，您必须在浏览器中调用该页面并调用 screen_capture 实用程序并在确切位置进行裁剪以“剪切”验证码。现在就是屏幕抓取。

I would not create an Item and use the ImagePipeline.

import urllib
import os
import subprocess

...

def start_requests(self):
    request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
    return [request]      

def fill_login_form(self,response):
    x = HtmlXPathSelector(response)
    img_src = x.select("//img/@src").extract()

    #delete the captcha file and use urllib to write it to disk
    os.remove("c:\captcha.jpg")
    urllib.urlretrieve(img_src[0], "c:\captcha.jpg")

    # I use an program here to show the jpg (actually send it somewhere)
    captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")

    # OR just get the input from the user from stdin
    captcha = raw_input("put captcha in manually>")  

    # this function performs the request and calls the process_home_page with
    # the response (this way you can chain pages from start_requests() to parse()

    return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]

    def process_home_page(self, response):
        # check if you logged in etc. etc.

...

What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.

That whole calling external subprocess thing could have been one nicer, but this works.

On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

回复收藏 0 原文

~没有更多了~