使用Scrapy抓取网页中的url

发布于 2024-12-10 12:06:48 字数 1652 浏览 0 评论 0原文

我正在使用 scrapy 从某些网站提取数据。问题是我的蜘蛛只能抓取初始 start_urls 的网页，它无法抓取网页中的 url。我完全复制了同一个蜘蛛：

    from scrapy.spider import BaseSpider
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from scrapy.utils.response import get_base_url
    from scrapy.utils.url import urljoin_rfc
    from nextlink.items import NextlinkItem

    class Nextlink_Spider(BaseSpider):
        name = "Nextlink"
        allowed_domains = ["Nextlink"]
        start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')          
        for site in sites:
            relative_url = site.extract()
            url =  self._urljoin(response,relative_url)
            yield Request(url, callback = self.parsetext)

    def parsetext(self, response):
        log = open("log.txt", "a")
        log.write("test if the parsetext is called")
        hxs = HtmlXPathSelector(response)
        items = []
        texts = hxs.select('//div').extract()
        for text in texts:
            item = NextlinkItem()
            item['text'] = text
            items.append(item)
            log = open("log.txt", "a")
            log.write(text)
        return items

    def _urljoin(self, response, url):
        """Helper to convert relative urls to absolute"""
        return urljoin_rfc(response.url, url, response.encoding)

我使用 log.txt 来测试是否调用了解析文本。但是，在运行我的蜘蛛之后，log.txt 中没有任何内容。

原文

I'm using scrapy to extract data from certain websites.The problem is that my spider can only crawl the webpage of initial start_urls , it can't crawl the urls in the webpage.
I copied the same spider exactly:

    from scrapy.spider import BaseSpider
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from scrapy.utils.response import get_base_url
    from scrapy.utils.url import urljoin_rfc
    from nextlink.items import NextlinkItem

    class Nextlink_Spider(BaseSpider):
        name = "Nextlink"
        allowed_domains = ["Nextlink"]
        start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//body/div[2]/div[3]/div/ul/li[2]/a/@href')          
        for site in sites:
            relative_url = site.extract()
            url =  self._urljoin(response,relative_url)
            yield Request(url, callback = self.parsetext)

    def parsetext(self, response):
        log = open("log.txt", "a")
        log.write("test if the parsetext is called")
        hxs = HtmlXPathSelector(response)
        items = []
        texts = hxs.select('//div').extract()
        for text in texts:
            item = NextlinkItem()
            item['text'] = text
            items.append(item)
            log = open("log.txt", "a")
            log.write(text)
        return items

    def _urljoin(self, response, url):
        """Helper to convert relative urls to absolute"""
        return urljoin_rfc(response.url, url, response.encoding)

I use the log.txt to test if the parsetext is called.However, after I runned my spider, there is nothing in the log.txt.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以可爱出名 2024-12-17 12:06:48

请参阅此处：

http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html?highlight=allowed_domains#scrapy.spider.BaseSpider.allowed_domains

允许的域
可选的字符串列表，其中包含允许该蜘蛛抓取的域。如果启用 OffsiteMiddleware，则不会跟踪不属于此列表中指定域名的 URL 请求。

因此，只要您没有在设置中激活 OffsiteMiddleware，就没有关系，您可以完全忽略 allowed_domains。

检查settings.py OffsiteMiddleware是否激活。如果您想让您的蜘蛛在任何域上爬行，则不应激活它。

回复收藏 0 原文

你丑哭了我 2024-12-17 12:06:48

我认为问题是，你没有告诉 Scrapy 跟踪每个爬行的 URL。对于我自己的博客，我实现了一个 CrawlSpider，它使用基于 LinkExtractor 的规则从我的博客页面中提取所有相关链接：

# -*- coding: utf-8 -*-

'''
*   This program is free software: you can redistribute it and/or modify
*   it under the terms of the GNU General Public License as published by
*   the Free Software Foundation, either version 3 of the License, or
*   (at your option) any later version.
*
*   This program is distributed in the hope that it will be useful,
*   but WITHOUT ANY WARRANTY; without even the implied warranty of
*   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
*   GNU General Public License for more details.
*
*   You should have received a copy of the GNU General Public License
*   along with this program.  If not, see <http://www.gnu.org/licenses/>.
*
*   @author Marcel Lange <[email protected]>
*   @package ScrapyCrawler 
 '''


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

import Crawler.settings
from Crawler.items import PageCrawlerItem


class SheldonSpider(CrawlSpider):
    name = Crawler.settings.CRAWLER_NAME
    allowed_domains = Crawler.settings.CRAWLER_DOMAINS
    start_urls = Crawler.settings.CRAWLER_START_URLS
    rules = (
        Rule(
            LinkExtractor(
                allow_domains=Crawler.settings.CRAWLER_DOMAINS,
                allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
                deny=Crawler.settings.CRAWLER_DENY_REGEX,
                restrict_css=Crawler.settings.CSS_SELECTORS,
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback='parse_item',
            process_links='filter_links'
        ),
    )

    # Filter links with the nofollow attribute
    def filter_links(self, links):
        return_links = list()
        if links:
            for link in links:
                if not link.nofollow:
                    return_links.append(link)
                else:
                    self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
        return return_links

    def parse_item(self, response):
        # self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
        item = PageCrawlerItem()
        item['status'] = response.status
        item['title'] = response.xpath('//title/text()')[0].extract()
        item['url'] = response.url
        item['headers'] = response.headers
        return item

在 https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/我已经详细描述了如何实现网站爬虫来预热我的 Wordpress 全页缓存。

I think the problem is, that you didn't tell Scrapy to follow each crawled URL. For my own blog I've implemented a CrawlSpider that uses LinkExtractor-based Rules to extract all relevant links from my blog pages:

# -*- coding: utf-8 -*-

'''
*   This program is free software: you can redistribute it and/or modify
*   it under the terms of the GNU General Public License as published by
*   the Free Software Foundation, either version 3 of the License, or
*   (at your option) any later version.
*
*   This program is distributed in the hope that it will be useful,
*   but WITHOUT ANY WARRANTY; without even the implied warranty of
*   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
*   GNU General Public License for more details.
*
*   You should have received a copy of the GNU General Public License
*   along with this program.  If not, see <http://www.gnu.org/licenses/>.
*
*   @author Marcel Lange <[email protected]>
*   @package ScrapyCrawler 
 '''


from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

import Crawler.settings
from Crawler.items import PageCrawlerItem


class SheldonSpider(CrawlSpider):
    name = Crawler.settings.CRAWLER_NAME
    allowed_domains = Crawler.settings.CRAWLER_DOMAINS
    start_urls = Crawler.settings.CRAWLER_START_URLS
    rules = (
        Rule(
            LinkExtractor(
                allow_domains=Crawler.settings.CRAWLER_DOMAINS,
                allow=Crawler.settings.CRAWLER_ALLOW_REGEX,
                deny=Crawler.settings.CRAWLER_DENY_REGEX,
                restrict_css=Crawler.settings.CSS_SELECTORS,
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback='parse_item',
            process_links='filter_links'
        ),
    )

    # Filter links with the nofollow attribute
    def filter_links(self, links):
        return_links = list()
        if links:
            for link in links:
                if not link.nofollow:
                    return_links.append(link)
                else:
                    self.logger.debug('Dropped link %s because nofollow attribute was set.' % link.url)
        return return_links

    def parse_item(self, response):
        # self.logger.info('Parsed URL: %s with STATUS %s', response.url, response.status)
        item = PageCrawlerItem()
        item['status'] = response.status
        item['title'] = response.xpath('//title/text()')[0].extract()
        item['url'] = response.url
        item['headers'] = response.headers
        return item

On https://www.ask-sheldon.com/build-a-website-crawler-using-scrapy-framework/ I've described detailed how I've implemented a website crawler to warm up my Wordpress fullpage cache.

回复收藏 0 原文