Scrapy CrawlSpider 后处理:求平均值
假设我有一个与此示例类似的爬行蜘蛛: 从 scrapy.contrib.spiders 导入 CrawlSpider,规则 从 scrapy.contrib.linkextractors.sgml 导入 SgmlLinkExtractor 从 scrapy.selector 导入 HtmlXPathSelector from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return item
假设我想获取一些信息,例如每个页面的 ID 总和,或者所有已解析页面的描述中的平均字符数。我该怎么做呢?
另外,我如何获得特定类别的平均值?
Let's say I have a crawl spider similar to this example:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
item = Item()
item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
return item
Let's say I wanted to get some information like the sum of the IDs from each of the pages, or the average number of characters in the description across all of the parsed pages. How would I do it?
Also, how could I get averages for a particular category?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 Scrapy 的 统计收集器 来构建此类信息或收集必要的数据所以随你去吧。对于每个类别的统计数据,您可以使用每个类别的统计数据键。
要快速转储抓取过程中收集的所有统计信息,您可以将
STATS_DUMP = True
添加到settings.py
中。Redis (来自 redis-py) 也是统计信息收集的一个不错的选择。
You could use Scrapy's stats collector to build this kind of information or gather the necessary data to do so as you go. For per-category stats, you could use a per-category stats key.
For a quick dump of all stats gathered during a crawl, you can add
STATS_DUMP = True
to yoursettings.py
.Redis (via redis-py) is also a great option for stats collection.