文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
17.3 编写爬虫模块
17.1节通过genspider命令已经创建了一个基于CrawlSpider类的爬虫模板,类名称为YunqiQqComSpider。下面开始进行页面的解析,主要有两个方法。parse_book_list方法用于解析图17-1所示的图书列表,抽取其中的小说基本信息。parse_book_detail方法用于解析图17-3所示页面中的小说点击量和人气等数据。对于翻页链接抽取,则是在rules中定义抽取规则,翻页链接基本上符合“/bk/so2/n30p\d+”这种形式,YunqiQqComSpider完整代码如下:
class YunqiQqComSpider(CrawlSpider): name = 'yunqi.qq.com' allowed_domains = ['yunqi.qq.com'] start_urls = ['http://yunqi.qq.com/bk/so2/n30p1'] rules = ( Rule(LinkExtractor(allow=r'/bk/so2/n30p\d+'), callback='parse_book_list', follow=True), ) def parse_book_list(self,response): books = response.xpath(".// div[@class='book']") for book in books: novelImageUrl = book.xpath("./a/img/@src").extract_first() novelId = book.xpath("./div[@class='book_info']/h3/a/@id").extract_first() novelName =book.xpath("./div[@class='book_info']/h3/a/text()"). extract_first() novelLink = book.xpath("./div[@class='book_info']/h3/a/@href"). extract_first() novelInfos = book.xpath("./div[@class='book_info']/dl/dd[@class='w_auth']") if len(novelInfos)>4: novelAuthor = novelInfos[0].xpath('./a/text()').extract_first() novelType = novelInfos[1].xpath('./a/text()').extract_first() novelStatus = novelInfos[2].xpath('./text()').extract_first() novelUpdateTime = novelInfos[3].xpath('./text()').extract_first() novelWords = novelInfos[4].xpath('./text()').extract_first() else: novelAuthor='' novelType ='' novelStatus='' novelUpdateTime='' novelWords=0 bookListItem = YunqiBookListItem(novelId=novelId,novelName=novelName, novelLink=novelLink,novelAuthor=novelAuthor, novelType=novelType,novelStatus=novelStatus, novelUpdateTime=novelUpdateTime,novelWords=novelWords, novelImageUrl=novelImageUrl) yield bookListItem request = scrapy.Request(url=novelLink,callback=self.parse_book_ detail) request.meta['novelId'] = novelId yield request def parse_book_detail(self,response): # from scrapy.shell import inspect_response # inspect_response(response, self) novelId = response.meta['novelId'] novelLabel = response.xpath("// div[@class='tags']/text()").extract_first() novelAllClick = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[1]/ text()").extract_first() novelAllPopular = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[2]/ text()").extract_first() novelAllComm = response.xpath(".// *[@id='novelInfo']/table/tr[2]/td[3]/ text()").extract_first() novelMonthClick = response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[1]/ text()").extract_first() novelMonthPopular = response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[2]/text()").extract_first() novelMonthComm = response.xpath(".// *[@id='novelInfo']/table/tr[3]/td[3]/text()").extract_first() novelWeekClick = response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[1]/ text()").extract_first() novelWeekPopular = response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[2]/text()").extract_first() novelWeekComm = response.xpath(".// *[@id='novelInfo']/table/tr[4]/td[3]/ text()").extract_first() novelCommentNum = response.xpath(".// *[@id='novelInfo_commentCount']/text()").extract_first() bookDetailItem = YunqiBookDetailItem(novelId=novelId,novelLabel=novelLabel, novelAllClick=novelAllClick,novelAllPopular=novelAllPopular, novelAllComm=novelAllComm,novelMonthClick=novelMonthClick, novelMonthPopular=novelMonthPopular, novelMonthComm=novelMonthComm,novelWeekClick=novelWeekClick, novelWeekPopular=novelWeekPopular, novelWeekComm=novelWeekComm,novelCommentNum=novelCommentNum) yield bookDetailItem
大家对页面的抽取应该很熟悉了,以上代码很简单,这里不再赘述。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论