Scrapy - 如何管理cookie/会话

发布于 10-17 04:12 字数 1000 浏览 10 评论 0原文

我对 cookie 如何与 Scrapy 一起工作以及如何管理这些 cookie 有点困惑。

这基本上是我想做的事情的简化版本: 在此处输入图像描述


网站的工作方式:

当您访问该网站时,您会获得一个会话 cookie。

当您进行搜索时,网站会记住您搜索的内容,因此当您执行转到下一页结果之类的操作时,它知道它正在处理的搜索。


我的脚本:

我的蜘蛛的起始网址为 searchpage_url

搜索页面由 parse() 请求,搜索表单响应传递给 search_generator()

search_generator() 然后使用 FormRequest 和搜索表单响应产生大量搜索请求。

每个 FormRequest 以及后续子请求都需要拥有自己的会话,因此需要拥有自己的单独 cookiejar 和会话 cookie。


我已经看过文档中讨论阻止 cookie 合并的元选项的部分。这实际上意味着什么?这是否意味着发出请求的蜘蛛将在其余生中拥有自己的 cookiejar?

如果 cookie 位于每个蜘蛛级别,那么当生成多个蜘蛛时它如何工作?是否可以只让第一个请求生成器生成新的蜘蛛,并确保从那时起只有该蜘蛛处理未来的请求?

我假设我必须禁用多个并发请求..否则一个蜘蛛将在同一会话cookie下进行多次搜索,并且未来的请求将仅与最近进行的搜索相关?

我很困惑,任何澄清都会受到极大的欢迎!


编辑:

我刚刚想到的另一种选择是完全手动管理会话 cookie,并将其从一个请求传递到另一个请求。

我想这意味着禁用cookie..然后从搜索响应中获取会话cookie,并将其传递给每个后续请求。

在这种情况下你应该这样做吗?

I'm a bit confused as to how cookies work with Scrapy, and how you manage those cookies.

This is basically a simplified version of what I'm trying to do:
enter image description here


The way the website works:

When you visit the website you get a session cookie.

When you make a search, the website remembers what you searched for, so when you do something like going to the next page of results, it knows the search it is dealing with.


My script:

My spider has a start url of searchpage_url

The searchpage is requested by parse() and the search form response gets passed to search_generator()

search_generator() then yields lots of search requests using FormRequest and the search form response.

Each of those FormRequests, and subsequent child requests need to have it's own session, so needs to have it's own individual cookiejar and it's own session cookie.


I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned? Is it possible to make only the first request generator spawn new spiders and make sure that from then on only that spider deals with future requests?

I assume I have to disable multiple concurrent requests.. otherwise one spider would be making multiple searches under the same session cookie, and future requests will only relate to the most recent search made?

I'm confused, any clarification would be greatly received!


EDIT:

Another options I've just thought of is managing the session cookie completely manually, and passing it from one request to the other.

I suppose that would mean disabling cookies.. and then grabbing the session cookie from the search response, and passing it along to each subsequent request.

Is this what you should do in this situation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

孤芳又自赏2024-10-24 04:12:46

三年后,我认为这正是您所寻找的:
http://doc.scrapy.org/en /latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

只需在蜘蛛的 start_requests 方法中使用类似的内容即可:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

并记住,对于后续请求,您每次都需要显式重新附加 cookiejar:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)

Three years later, I think this is exactly what you were looking for:
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

Just use something like this in your spider's start_requests method:

for i, url in enumerate(urls):
    yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
        callback=self.parse_page)

And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time:

def parse_page(self, response):
    # do some processing
    return scrapy.Request("http://www.example.com/otherpage",
        meta={'cookiejar': response.meta['cookiejar']},
        callback=self.parse_other_page)
红尘作伴2024-10-24 04:12:46
from scrapy.http.cookies import CookieJar
...

class Spider(BaseSpider):
    def parse(self, response):
        '''Parse category page, extract subcategories links.'''

        hxs = HtmlXPathSelector(response)
        subcategories = hxs.select(".../@href")
        for subcategorySearchLink in subcategories:
            subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
            self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
            yield Request(subcategorySearchLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True})
            '''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
            This is needed because the site uses sessions to remember the search parameters.'''

    def extractItemLinks(self, response):
        '''Extract item links from subcategory page and go to next page.'''
        hxs = HtmlXPathSelector(response)
        for itemLink in hxs.select(".../a/@href"):
            itemLink = urlparse.urljoin(response.url, itemLink)
            print 'Requesting item page %s' % itemLink
            yield Request(...)

        nextPageLink = self.getFirst(".../@href", hxs)
        if nextPageLink:
            nextPageLink = urlparse.urljoin(response.url, nextPageLink)
            self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
            cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
            cookieJar.extract_cookies(response, response.request)
            request = Request(nextPageLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
            cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
            yield request
        else:
            self.log('Whole subcategory scraped.', log.DEBUG)
from scrapy.http.cookies import CookieJar
...

class Spider(BaseSpider):
    def parse(self, response):
        '''Parse category page, extract subcategories links.'''

        hxs = HtmlXPathSelector(response)
        subcategories = hxs.select(".../@href")
        for subcategorySearchLink in subcategories:
            subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
            self.log('Found subcategory link: ' + subcategorySearchLink), log.DEBUG)
            yield Request(subcategorySearchLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True})
            '''Use dont_merge_cookies to force site generate new PHPSESSID cookie.
            This is needed because the site uses sessions to remember the search parameters.'''

    def extractItemLinks(self, response):
        '''Extract item links from subcategory page and go to next page.'''
        hxs = HtmlXPathSelector(response)
        for itemLink in hxs.select(".../a/@href"):
            itemLink = urlparse.urljoin(response.url, itemLink)
            print 'Requesting item page %s' % itemLink
            yield Request(...)

        nextPageLink = self.getFirst(".../@href", hxs)
        if nextPageLink:
            nextPageLink = urlparse.urljoin(response.url, nextPageLink)
            self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
            cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
            cookieJar.extract_cookies(response, response.request)
            request = Request(nextPageLink, callback = self.extractItemLinks,
                          meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
            cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
            yield request
        else:
            self.log('Whole subcategory scraped.', log.DEBUG)
谁的新欢旧爱2024-10-24 04:12:46
def parse(self, response):
    # do something
    yield scrapy.Request(
        url= "http://new-page-to-parse.com/page/4/",
        cookies= {
            'h0':'blah',
            'taeyeon':'pretty'
        },
        callback= self.parse
    )
def parse(self, response):
    # do something
    yield scrapy.Request(
        url= "http://new-page-to-parse.com/page/4/",
        cookies= {
            'h0':'blah',
            'taeyeon':'pretty'
        },
        callback= self.parse
    )
巡山小妖精2024-10-24 04:12:46

Scrapy 有一个 下载器中间件 CookiesMiddleware 实现支持cookie。您只需要启用它即可。它模仿浏览器中 cookiejar 的工作方式。

  • 当请求通过 CookiesMiddleware 时,它会读取该域的 cookie 并将其设置在标头 Cookie 上。
  • 当响应返回时,CookiesMiddleware 读取响应头 Set-Cookie 上服务器发送的 cookie。并将其保存/合并到 mw 上的 cookiejar 中。

我已经看过文档中讨论阻止 cookie 合并的元选项的部分。这实际上意味着什么?这是否意味着发出请求的蜘蛛将在其余生中拥有自己的 cookiejar?

如果 cookie 位于每个蜘蛛级别,那么当生成多个蜘蛛时它如何工作?

每个蜘蛛都有其唯一的下载中间件。所以蜘蛛有单独的饼干罐。

通常,来自一台 Spider 的所有请求都会共享一个 cookiejar。但是 CookiesMiddleware 可以选择自定义此行为

  • Request.meta["dont_merge_cookies"] = True 告诉 mw 这个请求不会读取 Cookie来自cookiejar。并且不要将 resp 中的 Set-Cookie 合并到 cookiejar 中。这是一个请求级别开关。
  • CookiesMiddleware 支持多个 cookiejar。您必须控制在请求级别使用哪个 cookiejar。 Request.meta["cookiejar"] = custom_cookiejar_name

请查看CookiesMiddleware的文档和相关源代码。

Scrapy has a downloader middleware CookiesMiddleware implemented to support cookies. You just need to enable it. It mimics how the cookiejar in browser works.

  • When a request goes through CookiesMiddleware, it reads cookies for this domain and set it on header Cookie.
  • When a response returns, CookiesMiddleware read cookies sent from server on resp header Set-Cookie. And save/merge it into the cookiejar on the mw.

I've seen the section of the docs that talks about a meta option that stops cookies from being merged. What does that actually mean? Does it mean the spider that makes the request will have its own cookiejar for the rest of its life?

If the cookies are then on a per Spider level, then how does it work when multiple spiders are spawned?

Every spider has its only download middleware. So spiders have separate cookiejars.

Normally, all requests from one Spider shares one cookiejar. But CookiesMiddleware have options to customize this behavior

  • Request.meta["dont_merge_cookies"] = True tells the mw this very req doesn't read Cookie from cookiejar. And don't merge Set-Cookie from resp into the cookiejar. It's a req level switch.
  • CookiesMiddleware supports multiple cookiejars. You have to control which cookiejar to use on the request level. Request.meta["cookiejar"] = custom_cookiejar_name.

Please the docs and relate source code of CookiesMiddleware.

迷乱花海2024-10-24 04:12:46

我认为最简单的方法是使用搜索查询作为蜘蛛参数(将在构造函数中接收)来运行同一蜘蛛的多个实例,以便重用 Scrapy 的 cookies 管理功能。因此,您将拥有多个蜘蛛实例,每个实例爬行一个特定的搜索查询及其结果。但是您需要自己运行爬虫:

scrapy crawl myspider -a search_query=something

或者您可以使用 Scrapyd 通过 JSON API 运行所有爬虫。

I think the simplest approach would be to run multiple instances of the same spider using the search query as a spider argument (that would be received in the constructor), in order to reuse the cookies management feature of Scrapy. So you'll have multiple spider instances, each one crawling one specific search query and its results. But you need to run the spiders yourself with:

scrapy crawl myspider -a search_query=something

Or you can use Scrapyd for running all the spiders through the JSON API.

(り薆情海2024-10-24 04:12:46

有几个 Scrapy 扩展提供了更多处理会话的功能:

  1. scrapy-sessions 允许您将静态定义的配置文件(代理和用户代理)附加到您的会话、处理 Cookie 并按需轮换配置文件
  2. scrapy-dynamic-sessions 几乎相同,但允许您随机选择代理和用户代理并处理由于任何错误而导致的重试请求

There are a couple of Scrapy extensions that provide a bit more functionality to work with sessions:

  1. scrapy-sessions allows you to attache statically defined profiles (Proxy and User-Agent) to your sessions, process Cookies and rotate profiles on demand
  2. scrapy-dynamic-sessions almost the same but allows you randomly pick proxy and User-Agent and handle retry request due to any errors
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文