5.2 更多例子
我们通过一个例子学习了如何使用Item Pipeline对数据进行处理,下面再看两个实际例子。
5.2.1 过滤重复数据
为了确保爬取到的书籍信息中没有重复项,可以实现一个去重Item Pipeline。这里,我们就以书名作为主键(实际应以ISBN编号为主键,但是仅爬取了书名和价格)进行去重,实现DuplicatesPipeline代码如下:
from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.book_set = set() def process_item(self, item, spider): name = item['name'] if name in self.book_set: raise DropItem("Duplicate book found: %s" % item) self.book_set.add(name) return item
对上述代码解释如下:
增加构造器方法,在其中初始化用于对书名去重的集合。
在process_item方法中,先取出item的name字段,检查书名是否已在集合book_set中,如果存在,就是重复数据,抛出DropItem异常,将item抛弃;否则,将item的name字段存入集合,返回item。
接下来测试DuplicatesPipeline。首先在不启用DuplicatesPipeline的情况下,运行爬虫,查看结果:
$ scrapy crawl books -o book1.csv ... $ cat -n book1.csv 1 price,name 2 ¥441.64,A Light in the Attic 3 ¥458.45,Tipping the Velvet 4 ¥427.40,Soumission 5 ¥407.95,Sharp Objects 6 ¥462.63,Sapiens: A Brief History of Humankind 7 ¥193.22,The Requiem Red 8 ¥284.42,The Dirty Little Secrets of Getting Your Dream Job ... 993 ¥317.86,Bounty (Colorado Mountain #7) 994 ¥173.18,Blood Defense (Samantha Brinkman #1) 995 ¥295.60,"Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach #1)" 996 ¥370.07,Beyond Good and Evil 997 ¥473.72,Alice in Wonderland (Alice's Adventures in Wonderland #1) 998 ¥486.77,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)" 999 ¥144.77,A Spy's Devotion (The Regency Spies of London #1) 1000 ¥460.50,1st to Die (Women's Murder Club #1) 1001 ¥222.49,"1,000 Places to See Before You Die"
此时有1000本书。
然后在配置文件settings.py中启用DuplicatesPipeline:
ITEM_PIPELINES = { 'example.pipelines.PriceConverterPipeline': 300, 'example.pipelines.DuplicatesPipeline': 350, }
运行爬虫,对比结果:
$ scrapy crawl books -o book2.csv ... $ cat -n book2.csv 1 name,price 2 A Light in the Attic,¥441.64 3 Tipping the Velvet,¥458.45 4 Soumission,¥427.40 5 Sharp Objects,¥407.95 6 Sapiens: A Brief History of Humankind,¥462.63 7 The Requiem Red,¥193.22 8 The Dirty Little Secrets of Getting Your Dream Job,¥284.42 ... 993 Blood Defense (Samantha Brinkman #1),¥173.18 994 "Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach #1)",¥295.60 995 Beyond Good and Evil,¥370.07 996 Alice in Wonderland (Alice's Adventures in Wonderland #1),¥473.72 997 "Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",¥486.77 998 A Spy's Devotion (The Regency Spies of London #1),¥144.77 999 1st to Die (Women's Murder Club #1),¥460.50 1000 "1,000 Places to See Before You Die",¥222.49
只有999本了,比之前少了1本,说明有两本书是同名的,翻阅爬虫的log信息可以找到重复项:
[scrapy.core.scraper] WARNING: Dropped: Duplicate book found: {'name': 'The Star-Touched Queen', 'price': '¥275.55'}
5.2.2 将数据存入MongoDB
有时,我们想把爬取到的数据存入某种数据库中,可以实现Item Pipeline完成此类任务。下面实现一个能将数据存入MongoDB数据库的Item Pipeline,代码如下:
from scrapy.item import Item import pymongo class MongoDBPipeline(object): DB_URI = 'mongodb://localhost:27017/' DB_NAME = 'scrapy_data' def open_spider(self, spider): self.client = pymongo.MongoClient(self.DB_URI) self.db = self.client[self.DB_NAME] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): collection = self.db[spider.name] post = dict(item) if isinstance(item, Item) else item collection.insert_one(post) return item
对上述代码解释如下。
在类属性中定义两个常量:
DB_URI 数据库的URI地址。
DB_NAME 数据库的名字。
在Spider整个爬取过程中,数据库的连接和关闭操作只需要进行一次,应在开始处理数据之前连接数据库,并在处理完所有数据之后关闭数据库。因此实现以下两个方法(在Spider打开和关闭时被调用):
open_spider(spider)
close_spider(spider)
分别在open_spider和close_spider方法中实现数据库的连接与关闭。
在process_item中实现MongoDB数据库的写入操作,使用self.db和spider.name获取一个集合(collection),然后将数据插入该集合,集合对象的insert_one方法需传入一个字典对象(不能传入Item对象),因此在调用前先对item的类型进行判断,如果item是Item对象,就将其转换为字典。
接下来测试MongoDBPipeline,在配置文件settings.py中启用MongoDBPipeline:
ITEM_PIPELINES = { 'example.pipelines.PriceConverterPipeline': 300, 'example.pipelines.MongoDBPipeline': 400, }
运行爬虫,并查看数据库中的结果:
$ scrapy crawl books ... $ mongo MongoDB shell version: 2.4.9 connecting to: test > use scrapy_data switched to db scrapy_data > db.books.count() 1000 > db.books.find() { "_id" : ObjectId("58ae39a89dcd191973cc588f"), "price" : "¥441.64", "name" : "A Light in the Attic" } { "_id" : ObjectId("58ae39a89dcd191973cc5890"), "price" : "¥458.45", "name" : "Tipping the Velvet" } { "_id" : ObjectId("58ae39a89dcd191973cc5891"), "price" : "¥427.40", "name" : "Soumission" } { "_id" : ObjectId("58ae39a89dcd191973cc5892"), "price" : "¥407.95", "name" : "Sharp Objects" } { "_id" : ObjectId("58ae39a89dcd191973cc5893"), "price" : "¥462.63", "name" : "Sapiens: A Brief History of Humankind" } { "_id" : ObjectId("58ae39a89dcd191973cc5894"), "price" : "¥193.22", "name" : "The Requiem Red" } { "_id" : ObjectId("58ae39a89dcd191973cc5895"), "price" : "¥284.42", "name" : "The Dirty Little Secrets of Getting Your Dream Job" } { "_id" : ObjectId("58ae39a89dcd191973cc5896"), "price" : "¥152.96", "name" : "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull" } { "_id" : ObjectId("58ae39a89dcd191973cc5897"), "price" : "¥192.80", "name" : "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics" } { "_id" : ObjectId("58ae39a89dcd191973cc5898"), "price" : "¥444.89", "name" : "The Black Maria" } { "_id" : ObjectId("58ae39a89dcd191973cc5899"), "price" : "¥119.35", "name" : "Starving Hearts (Triangular Trade Trilogy, #1)" } { "_id" : ObjectId("58ae39a89dcd191973cc589a"), "price" : "¥176.25", "name" : "Shakespeare's Sonnets" } { "_id" : ObjectId("58ae39a89dcd191973cc589b"), "price" : "¥148.95", "name" : "Set Me Free" } { "_id" : ObjectId("58ae39a89dcd191973cc589c"), "price" : "¥446.08", "name" : "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)" } { "_id" : ObjectId("58ae39a89dcd191973cc589d"), "price" : "¥298.75", "name" : "Rip it Up and Start Again" } { "_id" : ObjectId("58ae39a89dcd191973cc589e"), "price" : "¥488.39", "name" : "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991" } { "_id" : ObjectId("58ae39a89dcd191973cc589f"), "price" : "¥203.72", "name" : "Olio" } { "_id" : ObjectId("58ae39a89dcd191973cc58a0"), "price" : "¥320.68", "name" : "Mesaerion: The Best Science Fiction Stories 1800-1849" } { "_id" : ObjectId("58ae39a89dcd191973cc58a1"), "price" : "¥437.89", "name" : "Libertarianism for Beginners" } { "_id" : ObjectId("58ae39a89dcd191973cc58a2"), "price" : "¥385.34", "name" : "It's Only the Himalayas" } Type "it" for more
在上述实现中,数据库的URI地址和数据库的名字硬编码在代码中,如果希望通过配置文件设置它们,只需稍作改动,代码如下:
from scrapy.item import Item import pymongo class MongoDBPipeline(object): @classmethod def from_crawler(cls, crawler): cls.DB_URI = crawler.settings.get('MONGO_DB_URI', 'mongodb://localhost:27017/') cls.DB_NAME = crawler.settings.get('MONGO_DB_NAME', 'scrapy_data') return cls() def open_spider(self, spider): self.client = pymongo.MongoClient(self.DB_URI) self.db = self.client[self.DB_NAME] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): collection = self.db[spider.name] post = dict(item) if isinstance(item, Item) else item collection.insert_one(post) return item
对上述改动解释如下:
增加类方法from_crawler(cls, crawler),替代在类属性中定义DB_URI和DB_NAME。
如果一个Item Pipeline定义了from_crawler方法,Scrapy就会调用该方法来创建Item Pipeline对象。该方法有两个参数:
cls Item Pipeline类的对象(这里为MongoDBPipeline类对象)。
crawler Crawler是Scrapy中的一个核心对象,可以通过crawler的settings属性访问配置文件。
在from_crawler方法中,读取配置文件中的MONGO_DB_URI和MONGO_DB_NAME(不存在使用默认值),赋给cls的属性,即MongoDBPipeline类属性。
其他代码并没有任何改变,因为这里只是改变了设置MongoDBPipeline类属性的方式。
现在,我们可在配置文件settings.py中对所要使用的数据库进行设置:
MONGO_DB_URI = 'mongodb://192.168.1.105:27017/' MONGO_DB_NAME = 'liushuo_scrapy_data'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论