文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

5.2 更多例子

发布于 2024-02-05 21:13:20 字数 8895 浏览 0 评论 0 收藏 0

我们通过一个例子学习了如何使用Item Pipeline对数据进行处理，下面再看两个实际例子。

5.2.1　过滤重复数据

为了确保爬取到的书籍信息中没有重复项，可以实现一个去重Item Pipeline。这里，我们就以书名作为主键（实际应以ISBN编号为主键，但是仅爬取了书名和价格）进行去重，实现DuplicatesPipeline代码如下：

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

 def __init__(self):
 self.book_set = set()

def process_item(self, item, spider):
 name = item['name']
 if name in self.book_set:
  raise DropItem("Duplicate book found: %s" % item)

 self.book_set.add(name)
 return item

对上述代码解释如下：

增加构造器方法，在其中初始化用于对书名去重的集合。

在process_item方法中，先取出item的name字段，检查书名是否已在集合book_set中，如果存在，就是重复数据，抛出DropItem异常，将item抛弃；否则，将item的name字段存入集合，返回item。

接下来测试DuplicatesPipeline。首先在不启用DuplicatesPipeline的情况下，运行爬虫，查看结果：

$ scrapy crawl books -o book1.csv
...
$ cat -n book1.csv
  1 price,name
  2 ￥441.64,A Light in the Attic
  3 ￥458.45,Tipping the Velvet
  4 ￥427.40,Soumission
  5 ￥407.95,Sharp Objects
  6 ￥462.63,Sapiens: A Brief History of Humankind
  7 ￥193.22,The Requiem Red
  8 ￥284.42,The Dirty Little Secrets of Getting Your Dream Job
 ...
  993 ￥317.86,Bounty (Colorado Mountain #7)
  994 ￥173.18,Blood Defense (Samantha Brinkman #1)
  995 ￥295.60,"Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach #1)"
  996 ￥370.07,Beyond Good and Evil
  997 ￥473.72,Alice in Wonderland (Alice's Adventures in Wonderland #1)
  998 ￥486.77,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)"
  999 ￥144.77,A Spy's Devotion (The Regency Spies of London #1)
 1000 ￥460.50,1st to Die (Women's Murder Club #1)
 1001 ￥222.49,"1,000 Places to See Before You Die"

此时有1000本书。

然后在配置文件settings.py中启用DuplicatesPipeline：

ITEM_PIPELINES = {
 'example.pipelines.PriceConverterPipeline': 300,
 'example.pipelines.DuplicatesPipeline': 350,
}

运行爬虫，对比结果：

$ scrapy crawl books -o book2.csv
...
$ cat -n book2.csv
  1  name,price
  2  A Light in the Attic,￥441.64
  3  Tipping the Velvet,￥458.45
  4  Soumission,￥427.40
  5  Sharp Objects,￥407.95
  6  Sapiens: A Brief History of Humankind,￥462.63
  7  The Requiem Red,￥193.22
  8  The Dirty Little Secrets of Getting Your Dream Job,￥284.42
 ...
  993  Blood Defense (Samantha Brinkman #1),￥173.18
  994  "Bleach, Vol. 1: Strawberry and the Soul Reapers (Bleach #1)",￥295.60
  995  Beyond Good and Evil,￥370.07
  996  Alice in Wonderland (Alice's Adventures in Wonderland #1),￥473.72
  997  "Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",￥486.77
  998  A Spy's Devotion (The Regency Spies of London #1),￥144.77
  999  1st to Die (Women's Murder Club #1),￥460.50
 1000  "1,000 Places to See Before You Die",￥222.49

只有999本了，比之前少了1本，说明有两本书是同名的，翻阅爬虫的log信息可以找到重复项：

[scrapy.core.scraper] WARNING: Dropped: Duplicate book found:
{'name': 'The Star-Touched Queen', 'price': '￥275.55'}

5.2.2　将数据存入MongoDB

有时，我们想把爬取到的数据存入某种数据库中，可以实现Item Pipeline完成此类任务。下面实现一个能将数据存入MongoDB数据库的Item Pipeline，代码如下：

from scrapy.item import Item
import pymongo

class MongoDBPipeline(object):

 DB_URI = 'mongodb://localhost:27017/'
 DB_NAME = 'scrapy_data'

 def open_spider(self, spider):
   self.client = pymongo.MongoClient(self.DB_URI)
   self.db = self.client[self.DB_NAME]

 def close_spider(self, spider):
   self.client.close()

 def process_item(self, item, spider):
   collection = self.db[spider.name]
   post = dict(item) if isinstance(item, Item) else item
   collection.insert_one(post)
   return item

对上述代码解释如下。

在类属性中定义两个常量：

　DB_URI　数据库的URI地址。

　DB_NAME　数据库的名字。

在Spider整个爬取过程中，数据库的连接和关闭操作只需要进行一次，应在开始处理数据之前连接数据库，并在处理完所有数据之后关闭数据库。因此实现以下两个方法（在Spider打开和关闭时被调用）：

　open_spider(spider)

　close_spider(spider)

分别在open_spider和close_spider方法中实现数据库的连接与关闭。

在process_item中实现MongoDB数据库的写入操作，使用self.db和spider.name获取一个集合（collection），然后将数据插入该集合，集合对象的insert_one方法需传入一个字典对象（不能传入Item对象），因此在调用前先对item的类型进行判断，如果item是Item对象，就将其转换为字典。

接下来测试MongoDBPipeline，在配置文件settings.py中启用MongoDBPipeline：

ITEM_PIPELINES = {
 'example.pipelines.PriceConverterPipeline': 300,
 'example.pipelines.MongoDBPipeline': 400,
}

运行爬虫，并查看数据库中的结果：

  $ scrapy crawl books
  ...
  $ mongo
  MongoDB shell version: 2.4.9
  connecting to: test
  > use scrapy_data
  switched to db scrapy_data
  > db.books.count()
  1000
  > db.books.find()
  { "_id" : ObjectId("58ae39a89dcd191973cc588f"), "price" : "￥441.64", "name" : "A Light in the
Attic" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5890"), "price" : "￥458.45", "name" : "Tipping the
Velvet" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5891"), "price" : "￥427.40", "name" : "Soumission" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5892"), "price" : "￥407.95", "name" : "Sharp Objects" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5893"), "price" : "￥462.63", "name" : "Sapiens: A Brief
History of Humankind" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5894"), "price" : "￥193.22", "name" : "The Requiem
Red" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5895"), "price" : "￥284.42", "name" : "The Dirty Little
Secrets of Getting Your Dream Job" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5896"), "price" : "￥152.96", "name" : "The Coming
Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5897"), "price" : "￥192.80", "name" : "The Boys in the
Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5898"), "price" : "￥444.89", "name" : "The Black Maria" }
  { "_id" : ObjectId("58ae39a89dcd191973cc5899"), "price" : "￥119.35", "name" : "Starving Hearts
(Triangular Trade Trilogy, #1)" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589a"), "price" : "￥176.25", "name" : "Shakespeare's
Sonnets" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589b"), "price" : "￥148.95", "name" : "Set Me Free" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589c"), "price" : "￥446.08", "name" : "Scott Pilgrim's
Precious Little Life (Scott Pilgrim #1)" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589d"), "price" : "￥298.75", "name" : "Rip it Up and Start
Again" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589e"), "price" : "￥488.39", "name" : "Our Band Could Be
Your Life: Scenes from the American Indie Underground, 1981-1991" }
  { "_id" : ObjectId("58ae39a89dcd191973cc589f"), "price" : "￥203.72", "name" : "Olio" }
  { "_id" : ObjectId("58ae39a89dcd191973cc58a0"), "price" : "￥320.68", "name" : "Mesaerion: The
Best Science Fiction Stories 1800-1849" }
  { "_id" : ObjectId("58ae39a89dcd191973cc58a1"), "price" : "￥437.89", "name" : "Libertarianism for
Beginners" }
  { "_id" : ObjectId("58ae39a89dcd191973cc58a2"), "price" : "￥385.34", "name" : "It's Only the
Himalayas" }
  Type "it" for more

在上述实现中，数据库的URI地址和数据库的名字硬编码在代码中，如果希望通过配置文件设置它们，只需稍作改动，代码如下：

from scrapy.item import Item
import pymongo

class MongoDBPipeline(object):

 @classmethod
 def from_crawler(cls, crawler):
   cls.DB_URI = crawler.settings.get('MONGO_DB_URI',
            'mongodb://localhost:27017/')
   cls.DB_NAME = crawler.settings.get('MONGO_DB_NAME', 'scrapy_data')

   return cls()

 def open_spider(self, spider):
   self.client = pymongo.MongoClient(self.DB_URI)
   self.db = self.client[self.DB_NAME]
def close_spider(self, spider):
 self.client.close()

def process_item(self, item, spider):
 collection = self.db[spider.name]
 post = dict(item) if isinstance(item, Item) else item
 collection.insert_one(post)

 return item

对上述改动解释如下：

增加类方法from_crawler（cls, crawler），替代在类属性中定义DB_URI和DB_NAME。

如果一个Item Pipeline定义了from_crawler方法，Scrapy就会调用该方法来创建Item Pipeline对象。该方法有两个参数：

　cls Item Pipeline类的对象（这里为MongoDBPipeline类对象）。

　crawler Crawler是Scrapy中的一个核心对象，可以通过crawler的settings属性访问配置文件。

在from_crawler方法中，读取配置文件中的MONGO_DB_URI和MONGO_DB_NAME（不存在使用默认值），赋给cls的属性，即MongoDBPipeline类属性。

其他代码并没有任何改变，因为这里只是改变了设置MongoDBPipeline类属性的方式。

现在，我们可在配置文件settings.py中对所要使用的数据库进行设置：

MONGO_DB_URI = 'mongodb://192.168.1.105:27017/'
MONGO_DB_NAME = 'liushuo_scrapy_data'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

5.2 更多例子

5.2.1 过滤重复数据

5.2.2 将数据存入MongoDB

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

5.2.1　过滤重复数据

5.2.2　将数据存入MongoDB

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。