返回介绍

17.3 编写爬虫模块

发布于 2024-01-26 22:39:51 字数 4180 浏览 0 评论 0 收藏 0

17.1节通过genspider命令已经创建了一个基于CrawlSpider类的爬虫模板,类名称为YunqiQqComSpider。下面开始进行页面的解析,主要有两个方法。parse_book_list方法用于解析图17-1所示的图书列表,抽取其中的小说基本信息。parse_book_detail方法用于解析图17-3所示页面中的小说点击量和人气等数据。对于翻页链接抽取,则是在rules中定义抽取规则,翻页链接基本上符合“/bk/so2/n30p\d+”这种形式,YunqiQqComSpider完整代码如下:

  class YunqiQqComSpider(CrawlSpider):
     name = 'yunqi.qq.com'
     allowed_domains = ['yunqi.qq.com']
     start_urls = ['http://yunqi.qq.com/bk/so2/n30p1']
  
     rules = (
       Rule(LinkExtractor(allow=r'/bk/so2/n30p\d+'), callback='parse_book_list',
       follow=True),
     )
  
     def parse_book_list(self,response):
       books = response.xpath(".// div[@class='book']")
       for book in books:
            novelImageUrl = book.xpath("./a/img/@src").extract_first()
            novelId = book.xpath("./div[@class='book_info']/h3/a/@id").extract_first()
            novelName =book.xpath("./div[@class='book_info']/h3/a/text()").
              extract_first()
            novelLink = book.xpath("./div[@class='book_info']/h3/a/@href").
              extract_first()
            novelInfos = book.xpath("./div[@class='book_info']/dl/dd[@class='w_auth']")
            if len(novelInfos)>4:
              novelAuthor = novelInfos[0].xpath('./a/text()').extract_first()
              novelType = novelInfos[1].xpath('./a/text()').extract_first()
              novelStatus = novelInfos[2].xpath('./text()').extract_first()
              novelUpdateTime = novelInfos[3].xpath('./text()').extract_first()
              novelWords  = novelInfos[4].xpath('./text()').extract_first()
            else:
              novelAuthor=''
              novelType =''
              novelStatus=''
              novelUpdateTime=''
              novelWords=0
            bookListItem = YunqiBookListItem(novelId=novelId,novelName=novelName,
                     novelLink=novelLink,novelAuthor=novelAuthor,
                     novelType=novelType,novelStatus=novelStatus,
                     novelUpdateTime=novelUpdateTime,novelWords=novelWords,
                            novelImageUrl=novelImageUrl)
            yield bookListItem
  
            request = scrapy.Request(url=novelLink,callback=self.parse_book_
              detail)
            request.meta['novelId'] = novelId
            yield request
  
  
     def parse_book_detail(self,response):
       # from scrapy.shell import inspect_response
       # inspect_response(response, self)
       novelId = response.meta['novelId']
       novelLabel = response.xpath("// div[@class='tags']/text()").extract_first()
  
       novelAllClick = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[1]/
            text()").extract_first()
       novelAllPopular = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[2]/
            text()").extract_first()
       novelAllComm = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[3]/
            text()").extract_first()
  
       novelMonthClick = response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[1]/
            text()").extract_first()
       novelMonthPopular =
  response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[2]/text()").extract_first()
       novelMonthComm =
  response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[3]/text()").extract_first()
       novelWeekClick = response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[1]/
            text()").extract_first()
       novelWeekPopular =
  response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[2]/text()").extract_first()
       novelWeekComm = response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[3]/
            text()").extract_first()
       novelCommentNum =
  response.xpath(".// *[@id='novelInfo_commentCount']/text()").extract_first()
       bookDetailItem = YunqiBookDetailItem(novelId=novelId,novelLabel=novelLabel,
  novelAllClick=novelAllClick,novelAllPopular=novelAllPopular,
  novelAllComm=novelAllComm,novelMonthClick=novelMonthClick,
  novelMonthPopular=novelMonthPopular,
  novelMonthComm=novelMonthComm,novelWeekClick=novelWeekClick,
  novelWeekPopular=novelWeekPopular,
  novelWeekComm=novelWeekComm,novelCommentNum=novelCommentNum)
       yield bookDetailItem

大家对页面的抽取应该很熟悉了,以上代码很简单,这里不再赘述。

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文