Scrapy CrawlSpider 后处理:求平均值

发布于 2024-10-27 22:25:14 字数 1345 浏览 4 评论 0原文

假设我有一个与此示例类似的爬行蜘蛛: 从 scrapy.contrib.spiders 导入 CrawlSpider,规则 从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractor 从 scrapy.selector 导入 HtmlXPathSelector from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        return item

假设我想获取一些信息,例如每个页面的 ID 总和,或者所有已解析页面的描述中的平均字符数。我该怎么做呢?

另外,我如何获得特定类别的平均值?

Let's say I have a crawl spider similar to this example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        hxs = HtmlXPathSelector(response)
        item = Item()
        item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
        item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
        return item

Let's say I wanted to get some information like the sum of the IDs from each of the pages, or the average number of characters in the description across all of the parsed pages. How would I do it?

Also, how could I get averages for a particular category?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

懵少女 2024-11-03 22:25:14

您可以使用 Scrapy 的 统计收集器 来构建此类信息或收集必要的数据所以随你去吧。对于每个类别的统计数据,您可以使用每个类别的统计数据键。

要快速转储抓取过程中收集的所有统计信息,您可以将 STATS_DUMP = True 添加到 settings.py 中。

Redis (来自 redis-py) 也是统计信息收集的一个不错的选择。

You could use Scrapy's stats collector to build this kind of information or gather the necessary data to do so as you go. For per-category stats, you could use a per-category stats key.

For a quick dump of all stats gathered during a crawl, you can add STATS_DUMP = True to your settings.py.

Redis (via redis-py) is also a great option for stats collection.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文