文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
18.7 数据存储
对于数据存储,PySpider采用自带的ResultDB方式,这种设计是为了方便在WebUI预览,之后可以将数据下载成JSON等格式的文件,但是这种做法对于稍微大量的数据会很不实用,不适合工程化。要实现自定义存储,我们需要重写on_result方法,doubanMovie项目修改如下:
# coding:utf-8 from pymongo import MongoClient from pyspider.libs.base_handler import * class MongoStore(object): def __init__(self): client = MongoClient() db = client.douban self.movies = db.movies def insert(self,result): if result: self.movies.insert(result) class Handler(BaseHandler): crawl_config = { } mongo = MongoStore() headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/ 20100101 Firefox/50.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9, */*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Referer':'http://www.douban.com/'} @every(minutes=24 * 60) def on_start(self): self.crawl('http://movie.douban.com/tag/', headers = self.headers,callback=self.index_page,validate_cert=False) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('.tagCol>tbody>tr>td>a').items(): self.crawl(each.attr.href, headers = self.headers,callback=self. list_page,validate_cert=False) def list_page(self,response): for each in response.doc('.pl2>a').items(): self.crawl(each.attr.href, headers = self.headers,callback=self. detail_page,validate_cert=False) for each in response.doc('.next>a').items(): self.crawl(each.attr.href, headers = self.headers,callback=self. list_page,validate_cert=False) def detail_page(self, response): title = response.doc('# content>h1>span[property="v:itemreviewed"]').text() time = response.doc('# content>h1>span[class="year"]').text() director = response.doc('.attrs>a[rel="v:directedBy"]').text() actor=[] genre=[] for each in response.doc('a[rel="v:starring"]').items(): actor.append(each.text()) for each in response.doc('# info>span[property="v:genre"]').items(): genre.append(each.text()) rating = response.doc('.ll.rating_num').text() return { "url": response.url, "title": title, "time":time, "director":director, "actor":actor, "genre":genre, "rating":rating } def on_result(self, result): self.mongo.insert(result) super(Handler, self).on_result(result)
代码中添加了MongoStore用于初始化数据库连接和实现插入操作,在handler类中重写基类BaseHandler的on_result方法,实现对数据的存储和插入,同时调用BaseHandler中的on_result方法添加到默认的ResultDB中。最终存储的效果如图18-18所示。
图18-18 MongoDB存储
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论