Scrapy start_urls

发布于 2024-12-28 01:38:53 字数 1450 浏览 1 评论 0 原文

脚本(如下)来自 教程包含两个start_url

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    def parse(self, response):
        """
        The lines below is a spider contract. For more info see:
        http://doc.scrapy.org/en/latest/topics/contracts.html
        @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
        @scrapes name
        """
        sel = Selector(response)
        sites = sel.xpath('//ul[@class="directory-url"]/li')
        items = []

        for site in sites:
            item = Website()
            item['name'] = site.xpath('a/text()').extract()
            item['url'] = site.xpath('a/@href').extract()
            item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
            items.append(item)

        return items

但为什么它只抓取这两个网页呢?我看到 allowed_domains = ["dmoz.org"] 但这两个页面还包含指向 dmoz.org 域内其他页面的链接!为什么它不把它们也刮掉呢?

The script (below) from this tutorial contains two start_urls.

from scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    def parse(self, response):
        """
        The lines below is a spider contract. For more info see:
        http://doc.scrapy.org/en/latest/topics/contracts.html
        @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
        @scrapes name
        """
        sel = Selector(response)
        sites = sel.xpath('//ul[@class="directory-url"]/li')
        items = []

        for site in sites:
            item = Website()
            item['name'] = site.xpath('a/text()').extract()
            item['url'] = site.xpath('a/@href').extract()
            item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
            items.append(item)

        return items

But why does it scrape only these 2 web pages? I see allowed_domains = ["dmoz.org"] but these two pages also contain links to other pages which are within dmoz.org domain! Why doesnt it scrape them too?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

寻梦旅人 2025-01-04 01:38:53

start_urls 类属性包含起始 url - 仅此而已。如果您已经提取了要抓取的其他页面的 url,则使用 [another] 回调从 parse 回调相应的请求中产生收益:

class Spider(BaseSpider):

    name = 'my_spider'
    start_urls = [
                'http://www.domain.com/'
    ]
    allowed_domains = ['domain.com']

    def parse(self, response):
        '''Parse main page and extract categories links.'''
        hxs = HtmlXPathSelector(response)
        urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()
        for url in urls:
            url = urlparse.urljoin(response.url, url)
            self.log('Found category url: %s' % url)
            yield Request(url, callback = self.parseCategory)

    def parseCategory(self, response):
        '''Parse category page and extract links of the items.'''
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract()
        for link in links:
            itemLink = urlparse.urljoin(response.url, link)
            self.log('Found item link: %s' % itemLink, log.DEBUG)
            yield Request(itemLink, callback = self.parseItem)

    def parseItem(self, response):
        ...

如果您仍想自定义启动请求创建,请重写方法 BaseSpider.start_requests()

start_urls class attribute contains start urls - nothing more. If you have extracted urls of other pages you want to scrape - yield from parse callback corresponding requests with [another] callback:

class Spider(BaseSpider):

    name = 'my_spider'
    start_urls = [
                'http://www.domain.com/'
    ]
    allowed_domains = ['domain.com']

    def parse(self, response):
        '''Parse main page and extract categories links.'''
        hxs = HtmlXPathSelector(response)
        urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()
        for url in urls:
            url = urlparse.urljoin(response.url, url)
            self.log('Found category url: %s' % url)
            yield Request(url, callback = self.parseCategory)

    def parseCategory(self, response):
        '''Parse category page and extract links of the items.'''
        hxs = HtmlXPathSelector(response)
        links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract()
        for link in links:
            itemLink = urlparse.urljoin(response.url, link)
            self.log('Found item link: %s' % itemLink, log.DEBUG)
            yield Request(itemLink, callback = self.parseItem)

    def parseItem(self, response):
        ...

If you still want to customize start requests creation, override method BaseSpider.start_requests()

分开我的手 2025-01-04 01:38:53

start_urls 包含蜘蛛开始爬行的链接。
如果您想递归爬行,您应该使用crawlspider并为其定义规则。
http://doc.scrapy.org/en/latest/topics/spiders.html
例如,看看那里。

start_urls contain those links from which the spider start crawling.
If you want crawl recursively you should use crawlspider and define rules for that.
http://doc.scrapy.org/en/latest/topics/spiders.html
look there for example.

不顾 2025-01-04 01:38:53

该类没有 rules 属性。看看 http://readthedocs.org/docs/scrapy/en /latest/intro/overview.html 并搜索“rules”以查找示例。

The class does not have a rules property. Have a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for "rules" to find an example.

萌梦深 2025-01-04 01:38:53

如果您使用 BaseSpider,在回调中,您必须自己提取所需的 url 并返回 Request 对象。

如果您使用CrawlSpider,链接提取将由规则和与规则关联的SgmlLinkExtractor 负责。

If you use BaseSpider, inside the callback, you have to extract out your desired urls yourself and return a Request object.

If you use CrawlSpider, links extraction would be taken care of by the rules and the SgmlLinkExtractor associated with the rules.

゛时过境迁 2025-01-04 01:38:53

如果您使用规则来跟踪链接(已经在 scrapy 中实现),蜘蛛也会抓取它们。我希望有帮助...

    from scrapy.contrib.spiders import BaseSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector


    class Spider(BaseSpider):
        name = 'my_spider'
        start_urls = ['http://www.domain.com/']
        allowed_domains = ['domain.com']
        rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)]

     ...

If you use an rule to follow links (that is already implemented in scrapy), the spider will scrape them too. I hope have helped...

    from scrapy.contrib.spiders import BaseSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector


    class Spider(BaseSpider):
        name = 'my_spider'
        start_urls = ['http://www.domain.com/']
        allowed_domains = ['domain.com']
        rules = [Rule(SgmlLinkExtractor(allow=[], deny[]), follow=True)]

     ...
一念一轮回 2025-01-04 01:38:53

你没有编写函数来处理你想要得到的url。所以有两种解决方法。1.使用规则(crawlspider)2:编写函数来处理新的url。并将它们放入回调函数中。

you didn't write the function to deal the urls what you want to get.so two way to reslolve.1.use the the rule (crawlspider) 2:write the function to deal the new urls.and put them in the callback function.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文