Scrapy 按顺序爬取 URL

发布于 2024-11-18 10:50:23 字数 1356 浏览 6 评论 0原文

所以,我的问题比较简单。我有一个蜘蛛爬行多个站点,我需要它按照我在代码中编写的顺序返回数据。已发布在下面。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

结果以随机顺序返回,例如返回第 29 个,然后是第 28 个,然后是第 30 个。我尝试将调度程序顺序从 DFO 更改为 BFO,以防万一出现问题,但这并没有改变任何内容。

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

喵星人汪星人 2024-11-25 10:50:24

谷歌小组讨论建议在请求对象中使用优先级属性。
Scrapy 保证默认情况下会在 DFO 中抓取 url。但它不能确保按照解析回调中生成的顺序访问 url。

您不想生成 Request 对象,而是返回一个 Request 数组,从中弹出对象直到它为空。

你能尝试这样的事情吗?

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]

   def start_requests(self):
       start_urls = reversed( [
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
       ] )

       return [ Request(url = start_url) for start_url in start_urls ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

The google group discussion suggests using priority attribute in Request object.
Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

Can you try something like that?

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]

   def start_requests(self):
       start_urls = reversed( [
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
           "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
       ] )

       return [ Request(url = start_url) for start_url in start_urls ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items
晨曦慕雪 2024-11-25 10:50:24

有一种更简单的方法可以让 scrapy 遵循starts_url的顺序:您只需取消注释并将settings.py中的并发请求更改为1即可。

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1

There is a much easier way to make scrapy follow the order of starts_url: you can just uncomment and change the concurrent requests in settings.py to 1.

Configure maximum concurrent requests performed by Scrapy (default: 16) 
CONCURRENT_REQUESTS = 1
扶醉桌前 2024-11-25 10:50:24

我怀疑除非您使用 scrapy 内部结构,否则是否有可能实现您想要的目标。 scrapy google group上有一些类似的讨论,例如

http://groups.google.com/group/scrapy-users/browse_thread/thread/25da0a888ac19a9/1f72594b6db059f4?lnk=gst

还有一件事可以提供帮助:
设置 CONCURRENT_REQUESTS_PER_SPIDER
为1,但并不能完全保证
该订单要么是因为
下载器有自己的本地队列
出于性能原因,所以最好
你能做的就是优先考虑请求
但不能保证其确切顺序。

I doubt if it's possible to achieve what you want unless you play with scrapy internals. There are some similar discussions on scrapy google groups e.g.

http://groups.google.com/group/scrapy-users/browse_thread/thread/25da0a888ac19a9/1f72594b6db059f4?lnk=gst

One thing that can also help is
setting CONCURRENT_REQUESTS_PER_SPIDER
to 1, but it won't completely ensure
the order either because the
downloader has its own local queue
for performance reasons, so the best
you can do is prioritize the requests
but not ensure its exact order.

献世佛 2024-11-25 10:50:24

解决方案是连续的。
这个解决方案类似于@wuliang,

我从@Alexis de Tréglodé方法开始,但遇到了一个问题:
事实上,您的 start_requests() 方法返回 URL 列表
返回 [ start_urls 中 start_url 的请求(url = start_url) ]
导致输出不连续(异步)

如果返回的是单个响应,则通过创建替代 other_urls 可以满足要求。此外,other_urls 可用于添加从其他网页抓取的 URL。

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items

The solution is sequential.
This solution is similar to @wuliang

I started with @Alexis de Tréglodé method but reached a problem:
The fact that your start_requests() method returns a list of URLS
return [ Request(url = start_url) for start_url in start_urls ]
is causing the output to be non-sequential (asynchronous)

If the return is a single response then by creating an alternative other_urls can fulfill the requirements. Also, other_urls can be used to add-into URLs scraped from other webpages.

from scrapy import log
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from practice.items import MlboddsItem

log.start()

class PracticeSpider(BaseSpider):
    name = "sbrforum.com"
    allowed_domains = ["sbrforum.com"]

    other_urls = [
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
            "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/",
           ]

    def start_requests(self):
        log.msg('Starting Crawl!', level=log.INFO)
        start_urls = "http://www.sbrforum.com/mlb-baseball/odds-scores/20110327/"
        return [Request(start_urls, meta={'items': []})]

    def parse(self, response):
        log.msg("Begin Parsing", level=log.INFO)
        log.msg("Response from: %s" % response.url, level=log.INFO)
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("//*[@id='moduleData8460']")
        items = response.meta['items']
        for site in sites:
            item = MlboddsItem()
            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()
            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text()').extract()
            items.append(item)

        # here we .pop(0) the next URL in line
        if self.other_urls:
            return Request(self.other_urls.pop(0), meta={'items': items})

        return items
私藏温柔 2024-11-25 10:50:24

在设置中添加这个

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

add this in settings

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
日暮斜阳 2024-11-25 10:50:24

免责声明:还没有专门使用 scrapy

抓取器可能会根据超时和 HTTP 错误对请求进行排队和重新排队,如果您可以从响应页面获取日期会容易得多?

即添加另一个获取日期的 hxs.select 语句(刚刚看了一下,它肯定在响应数据中),并将其添加到项目字典中,根据该结果对项目进行排序。

这可能是一种更强大的方法,而不是依赖于刮擦顺序......

Disclaimer: haven't worked with scrapy specifically

The scraper may be queueing and requeueing requests based on timeouts and HTTP errors, it would be a lot easier if you can get at the date from the response page?

I.e. add another hxs.select statement that grabs the date (just had a look, it is definitely in the response data), and add that to the item dict, sort items based on that.

This is probably a more robust approach, rather than relying on order of scrapes...

北城孤痞 2024-11-25 10:50:24

当然,你可以控制它。
最高机密是如何满足贪婪的引擎/调度器的需求。你的要求只是一点点。请参阅我添加一个名为“task_urls”的列表。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from dirbot.items import Website

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
   ]
   task_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]
   def parse(self, response): 

       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = Website()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       # Here we feed add new request
       self.task_urls.remove(response.url)
       if self.task_urls:
           r = Request(url=self.task_urls[0], callback=self.parse)
           items.append(r)

       return items

如果您想要一些更复杂的案例,请参阅我的项目:
https://github.com/wuliang/TiebaPostGrabber

Off course, you can control it.
The top secret is the method how to feed the greedy Engine/Schedulor. You requirement is just a little one. Please see I add a list named "task_urls".

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from dirbot.items import Website

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
   ]
   task_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]
   def parse(self, response): 

       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = Website()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       # Here we feed add new request
       self.task_urls.remove(response.url)
       if self.task_urls:
           r = Request(url=self.task_urls[0], callback=self.parse)
           items.append(r)

       return items

If you want some more complicated case, please see my project:
https://github.com/wuliang/TiebaPostGrabber

祁梦 2024-11-25 10:50:24

我知道这是一个老问题,但今天我遇到了这个问题,并且对我在该线程中找到的解决方案并不完全满意。我是这样处理的。

蜘蛛:

import scrapy

class MySpider(scrapy.Spider):

    name = "mySpider"

    start_urls = None

    def parse(self, response):
        #your parsing code goes here

    def __init__(self, urls):
        self.start_urls = urls

和蜘蛛跑步者:

from twisted.internet import reactor, defer
import spiders.mySpider as ms
from scrapy.crawler import CrawlerRunner

urls = [
    'http://address1.com',
    'http://address2.com',
    'http://address3.com'
   ]

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    for url in urls:
        yield runner.crawl(ms.MySpider, urls = [url])
    reactor.stop()

crawl()
reactor.run()

此代码使用作为参数传递的列表中的 url 调用蜘蛛,然后等待直到完成,然后使用下一个 url 再次调用蜘蛛

I know this is an old question but I struggled with this problem today and was not completely satisfied with the solutions I found in this thread. Here's how I handled it.

the spider:

import scrapy

class MySpider(scrapy.Spider):

    name = "mySpider"

    start_urls = None

    def parse(self, response):
        #your parsing code goes here

    def __init__(self, urls):
        self.start_urls = urls

and the spider runner:

from twisted.internet import reactor, defer
import spiders.mySpider as ms
from scrapy.crawler import CrawlerRunner

urls = [
    'http://address1.com',
    'http://address2.com',
    'http://address3.com'
   ]

runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    for url in urls:
        yield runner.crawl(ms.MySpider, urls = [url])
    reactor.stop()

crawl()
reactor.run()

this code calls the spider with a url from the list passed as a parameter and then waits until it is finished before calling the spider again with the next url

东风软 2024-11-25 10:50:24

我相信

hxs.select('...')

您所做的将会按照出现的顺序从网站上抓取数据。或者 scrapy 都会以任意顺序遍历您的 start_urls 。要强制它按预定义的顺序遍历它们,请注意,如果您需要抓取更多网站,这将不起作用,那么您可以尝试以下操作:

start_urls = ["url1.html"]

def parse1(self, response):
    hxs = HtmlXPathSelector(response)
   sites = hxs.select('blah')
   items = []
   for site in sites:
       item = MlboddsItem()
       item['header'] = site.select('blah')
       item['game1'] = site.select('blah')
       items.append(item)
   return items.append(Request('url2.html', callback=self.parse2))

然后编写一个 parse2 执行相同的操作,但附加一个对 url3 的请求.html 与回调 = self.parse3。这是可怕的编码风格,但我只是把它扔掉,以防你需要快速破解。

I believe the

hxs.select('...')

you make will scrape the data from the site in the order it appears. Either that or scrapy is going through your start_urls in an arbitrary order. To force it to go through them in a predefined order, and mind you, this won't work if you need to crawl more sites, then you can try this:

start_urls = ["url1.html"]

def parse1(self, response):
    hxs = HtmlXPathSelector(response)
   sites = hxs.select('blah')
   items = []
   for site in sites:
       item = MlboddsItem()
       item['header'] = site.select('blah')
       item['game1'] = site.select('blah')
       items.append(item)
   return items.append(Request('url2.html', callback=self.parse2))

then write a parse2 that does the same thing but appends a Request for url3.html with callback=self.parse3. This is horrible coding style, but I'm just throwing it out in case you need a quick hack.

山色无中 2024-11-25 10:50:24

就我个人而言,在我设法找到自己的解决方案后,我喜欢@user1460015 的实现。

我的解决方案是使用Python的子进程逐个url调用scrapy url,直到处理完所有url。

在我的代码中,如果用户没有指定他/她想要按顺序解析网址,我们可以以正常方式启动蜘蛛。

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()

如果用户指定需要按顺序完成,我们可以这样做:

for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
        + url + ' -o ' + outputfile)
    process.wait()

注意:此实现不处理错误。

Personally I like @user1460015's implementation after I managed to have my own work around solution.

My solution is to use subprocess of Python to call scrapy url by url until all urls have been took care of.

In my code, if user does not specify he/she wants to parse the urls sequentially, we can start the spider in a normal way.

process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; \
    MSIE 7.0; Windows NT 5.1)'})
process.crawl(Spider, url = args.url)
process.start()

If a user specifies it needs to be done sequentially, we can do this:

for url in urls:
    process = subprocess.Popen('scrapy runspider scrapper.py -a url='\
        + url + ' -o ' + outputfile)
    process.wait()

Note that: this implementation does not handle errors.

无人接听 2024-11-25 10:50:24

大多数答案建议一一传递网址或将并发限制为1,
如果您抓取多个网址,这会显着减慢您的速度。

当我遇到同样的问题时,我的解决方案是使用回调参数来存储抓取的数据
从所有的url中,并使用初始url的顺序对其进行排序,然后按顺序返回所有抓取的数据,如下所示:

import scrapy

class MLBoddsSpider(scrapy.Spider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   to_scrape_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def start_requests(self):
       data = {}
       for url in self.to_scrape_urls:
           yield scrapy.Request(url, self.parse, cb_kwargs=data)

   def parse(self, response, **kwargs):
       # scrape the data and add it to kwargs
       kwargs[response.url] = response.css('myData').get()

       # check if all urls has been scraped yet
       if len(kwargs) == len(self.to_scrape_urls):
           # return a sorted list of your data
           return [kwargs[url] for url in self.to_scrape_urls]

Most of answers suggest passing urls one by one or limiting the concurrency to 1,
which will slow you down significantly if you're scraping multiple urls.

While I faced this same problem my solution was using the callback arguments to store scraped data
from all the urls, and sort it using the order of the initial urls, then return all the scraped data in ordered at once, something like this:

import scrapy

class MLBoddsSpider(scrapy.Spider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   to_scrape_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def start_requests(self):
       data = {}
       for url in self.to_scrape_urls:
           yield scrapy.Request(url, self.parse, cb_kwargs=data)

   def parse(self, response, **kwargs):
       # scrape the data and add it to kwargs
       kwargs[response.url] = response.css('myData').get()

       # check if all urls has been scraped yet
       if len(kwargs) == len(self.to_scrape_urls):
           # return a sorted list of your data
           return [kwargs[url] for url in self.to_scrape_urls]
随心而道 2024-11-25 10:50:23

Scrapy 请求现在有一个 priority 属性。

如果一个函数中有很多Request,并且想先处理某个特定的请求,可以设置:

def parse(self, response):
    url = 'http://www.example.com/first'
    yield Request(url=url, callback=self.parse_data, priority=1)

    url = 'http://www.example.com/second'
    yield Request(url=url, callback=self.parse_data)

Scrapy 将首先处理priority=1 的请求。

Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):
    url = 'http://www.example.com/first'
    yield Request(url=url, callback=self.parse_data, priority=1)

    url = 'http://www.example.com/second'
    yield Request(url=url, callback=self.parse_data)

Scrapy will process the one with priority=1 first.

贩梦商人 2024-11-25 10:50:23

start_urls 定义 start_requests 方法。下载页面时,将调用您的 parse 方法,并为每个起始网址提供响应。但您无法控制加载时间 - 第一个起始 URL 可能是最后一个解析

解决方案 - 覆盖 start_requests 方法,并向生成的请求添加带有 priority 键的 meta 。在parse中提取该priority值并将其添加到item中。在管道中根据这个值做一些事情。 (我不知道为什么以及在哪里需要按此顺序处理这些网址)。

或者使其变得同步——将这些起始 URL 存储在某处。放入 start_urls 其中第一个。在 parse 中,处理第一个响应并生成项目,然后从存储中获取下一个 url,并通过 parse 的回调发出请求。

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文