文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
17.4 Pipeline
上一节完成了爬虫模块的编写,下面开始编写Pipeline,主要是完成Item到MongoDB的存储,分成两个集合进行存储,并采用上一章搭建的MongoDB集群的方式。和之前编写的Pipeline大同小异,在其中加入了一部分数据清洗操作。代码如下:
class YunqicrawlPipeline(object): def __init__(self, mongo_uri, mongo_db,replicaset): self.mongo_uri = mongo_uri self.mongo_db = mongo_db self.replicaset = replicaset @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'yunqi'), replicaset = crawler.settings.get('REPLICASET') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri,replicaset=self.replicaset) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): if isinstance(item,YunqiBookListItem): self._process_booklist_item(item) else: self._process_bookeDetail_item(item) return item def _process_booklist_item(self,item): ''' 处理小说信息 :param item: :return: ''' self.db.bookInfo.insert(dict(item)) def _process_bookeDetail_item(self,item): ''' 处理小说热度 :param item: :return: ''' # 需要对数据进行清洗,类似:总字数:10120,提取其中的数字 pattern = re.compile('\d+') # 去掉空格和换行 item['novelLabel'] = item['novelLabel'].strip().replace('\n','') match = pattern.search(item['novelAllClick']) item['novelAllClick'] = match.group() if match else item['novelAllClick'] match = pattern.search(item['novelMonthClick']) item['novelMonthClick'] = match.group() if match else item['novelMonthClick'] match = pattern.search(item['novelWeekClick']) item['novelWeekClick'] = match.group() if match else item['novelWeekClick'] match = pattern.search(item['novelAllPopular']) item['novelAllPopular'] = match.group() if match else item['novelAllPopular'] match = pattern.search(item['novelMonthPopular']) item['novelMonthPopular'] = match.group() if match else item['novelMonthPopular'] match = pattern.search(item['novelWeekPopular']) item['novelWeekPopular'] = match.group() if match else item['novelWeekPopular'] match = pattern.search(item['novelAllComm']) item['novelAllComm'] = match.group() if match else item['novelAllComm'] match = pattern.search(item['novelMonthComm']) item['novelMonthComm'] = match.group() if match else item['novelMonthComm'] match = pattern.search(item['novelWeekComm']) item['novelWeekComm'] = match.group() if match else item['novelWeekComm'] self.db.bookhot.insert(dict(item))
最后在settings中添加如下代码,激活Pipeline。
ITEM_PIPELINES = { 'zhihuCrawl.pipelines.ZhihucrawlPipeline': 300, }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论