在 Scrapy 中使用经过身份验证的会话进行爬网

发布于 2024-11-04 22:16:52 字数 1281 浏览 5 评论 0原文

在我的上一个问题中，我对我的问题不是很具体（使用Scrapy进行经过身份验证的会话进行抓取），在希望能够从更一般的答案中推导出解决方案。我可能更应该使用“爬行”这个词。

所以，这是到目前为止我的代码：

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

如您所见，我访问的第一个页面是登录页面。如果我尚未通过身份验证（在 parse 函数中），我会调用自定义 login 函数，该函数会发布到登录表单。然后，如果我已通过身份验证，我想继续抓取。

问题是我尝试重写以登录的 parse 函数现在不再进行必要的调用来抓取任何其他页面（我假设）。而且我不知道如何保存我创建的项目。

以前有人做过类似的事情吗？（使用 CrawlSpider 进行身份验证，然后抓取）任何帮助将不胜感激。

原文

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create. 
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.
), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情徒 2024-11-11 22:16:52

不要重写 CrawlSpider 中的 parse 函数：

当您使用 CrawlSpider 时，不应重写解析函数。这里的 CrawlSpider 文档中有一个警告：http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

这是因为使用 CrawlSpider， parse（任何请求的默认回调）发送要由Rule处理的响应。

爬行前登录：

为了在蜘蛛开始爬行之前进行某种初始化，您可以使用 InitSpider （它继承自 CrawlSpider)，并覆盖 init_request 函数。该函数将在蜘蛛初始化时、开始爬行之前被调用。

为了让蜘蛛开始爬行，您需要调用self.initialized。

您可以在此处（它有有用的文档字符串）。

示例：

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com/login'
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html

 保存项目：
您的 Spider 返回的项目将被传递到管道，管道负责对数据执行您想要执行的任何操作。我建议您阅读文档： http://doc.scrapy.org/ en/0.14/topics/item-pipeline.html
如果您对 Item 有任何问题/疑问，请随时提出新问题，我会做的我尽力提供帮助。
),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

保存项目：

您的 Spider 返回的项目将被传递到管道，管道负责对数据执行您想要执行的任何操作。我建议您阅读文档： http://doc.scrapy.org/ en/0.14/topics/item-pipeline.html

如果您对 Item 有任何问题/疑问，请随时提出新问题，我会做的我尽力提供帮助。

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.

Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).

An example:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com/login'
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html

Saving items:
Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html
If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.
),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

回复收藏 0 原文

昇り龍 2024-11-11 22:16:52

为了使上述解决方案发挥作用，我必须通过在 scrapy 源代码上更改以下内容，使 CrawlSpider 继承自 InitSpider，而不再继承自 BaseSpider。在文件 scrapy/contrib/spiders/crawl.py 中：

添加： from scrapy.contrib.spiders.init import InitSpider
将 class CrawlSpider(BaseSpider) 更改为 class CrawlSpider (InitSpider)

否则蜘蛛不会调用 init_request 方法。

还有其他更简单的方法吗？

回复收藏 0 原文

筱武穆 2024-11-11 22:16:52

如果您需要的是 Http 身份验证，请使用提供的中间件挂钩。

在 settings.py

DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']

和 spider class 中添加属性

http_user = "user"
http_pass = "pass"

If what you need is Http Authentication use the provided middleware hooks.

in settings.py

DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']

and in your spider class add properties

http_user = "user"
http_pass = "pass"

回复收藏 0 原文

随风而去 2024-11-11 22:16:52

只是添加到上面橡子的答案中。
使用他的方法，我的脚本在登录后没有解析 start_urls。
在 check_login_response 中成功登录后退出。
不过我可以看到我有发电机。
我需要使用

return self.initialized()

然后调用解析函数。

Just adding to Acorn's answer above.
Using his method my script was not parsing the start_urls after the login.
It was exiting after a successful login in check_login_response.
I could see I had the generator though.
I needed to to use

return self.initialized()

then the parse function was called.

回复收藏 0 原文

弃爱 2024-11-11 22:16:52

对我来说，情况完全不同。我发现我在登录后从网站收到了身份验证令牌和用户令牌作为cookie。

这就是为什么我寻找一个快速简单的解决方案，就是像这样将 cookie 传递给请求。确保 cookie 不会很快过期。就我而言，它会在一个月后到期。调用此请求后，我将自动登录到该页面：

yield scrapy.Request(
  url=url, 
  callback=self.parse, 
  encoding='utf-8', 
  cookies=[
        {
          'name': 'token',
          'value': '7f2e791v27v7027ob28dgvw70dva0d76ad',
          'domain': '.test.website.com',
          'path': '/',
          'expires': '2024-07-05T10:30:50.200Z',
          'size': '105'},
        {
          'name': 'user',
          'value': 'zu32geubwq8wbe8e73vebqw0vewq7evwqe70',
          'domain': '.test.website.com',
          'path': '/',
          'expires': '2024-07-05T10:30:50.200Z',
          'size': '100'}
  ]
)

For me it was quite different. I saw that I am receiving an auth token and an user token as cookies from the website after my login.

Thats why I searched for a quick and easy solution which was to just pass over the cookies to the Request like that. Make sure that the cookie does not expire quickly. In my case it expires in a month. This request is called and I will be automatically logged in to the page:

yield scrapy.Request(
  url=url, 
  callback=self.parse, 
  encoding='utf-8', 
  cookies=[
        {
          'name': 'token',
          'value': '7f2e791v27v7027ob28dgvw70dva0d76ad',
          'domain': '.test.website.com',
          'path': '/',
          'expires': '2024-07-05T10:30:50.200Z',
          'size': '105'},
        {
          'name': 'user',
          'value': 'zu32geubwq8wbe8e73vebqw0vewq7evwqe70',
          'domain': '.test.website.com',
          'path': '/',
          'expires': '2024-07-05T10:30:50.200Z',
          'size': '100'}
  ]
)

回复收藏 0 原文

~没有更多了~