如何避免爬虫中的重复
我使用python中的scrapy框架编写了一个爬虫来选择一些链接和元标记。然后它爬取起始url并将数据以JSON编码格式写入文件中。问题是当爬虫运行两到三次时使用相同的起始网址,文件中的数据会重复。为了避免这种情况,我在 scrapy 中使用了下载器中间件,如下所示: http://snippets.scrapy.org/snippets/1/
我所做的是将上面的代码复制并粘贴到我的 scrapy 项目内的文件中,然后通过添加在 settings.py 文件中启用它以下行:
SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}
其中“a11ypi.removeDuplicates.IgnoreVisitedItems”是类路径名,最后我进入并修改了我的 items.py 文件并包含以下字段
visit_id = Field()
visit_status = Field()
但这不起作用,爬网程序仍然会产生相同的结果附加它运行两次时,
我对 pipelines.py 文件中的文件进行了写入,如下所示:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
json.dump(d,self.file)
return item
我的蜘蛛代码如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "a11y.in"
allowed_domains = ["a11y.in"]
# This is the list of seed URLs to begin crawling with.
start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]
# This is the callback method, which is used for scraping specific data
def parse(self,response):
temp = []
hxs = HtmlXPathSelector(response)
item = AYpiItem()
wholeforuri = hxs.select("//@foruri").extract() # XPath to extract the foruri, which contains both the URL and id in foruri
for i in wholeforuri:
temp.append(i.rpartition(":"))
item["foruri"] = [i[0] for i in temp] # This contains the URL in foruri
item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri] # This contains the id in foruri
item['thisurl'] = response.url
item["thisid"] = hxs.select("//@foruri/../@id").extract()
item["rec"] = hxs.select("//@foruri/../@rec").extract()
return item
请建议做什么。
I wrote a crawler using the scrapy framework in python to select some links and meta tags.It then crawls the start urls and write the data in a JSON encoded format onto a file.The problem is that when the crawler is run two or three times with the same start urls the data in the file gets duplicated .To avoid this I used a downloader middleware in scrapy which is this : http://snippets.scrapy.org/snippets/1/
What I did was copy and paste the above code in a file inside my scrapy project and I enabled it in the settings.py file by adding the following line:
SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}
where "a11ypi.removeDuplicates.IgnoreVisitedItems" is the class path name and finally I went in and modified my items.py file and included the following fields
visit_id = Field()
visit_status = Field()
But this doesn't work and still the crawler produces the same result appending it to the file when run twice
I did the writing to the file in my pipelines.py file as follows:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
json.dump(d,self.file)
return item
And my spider code is as follows:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "a11y.in"
allowed_domains = ["a11y.in"]
# This is the list of seed URLs to begin crawling with.
start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]
# This is the callback method, which is used for scraping specific data
def parse(self,response):
temp = []
hxs = HtmlXPathSelector(response)
item = AYpiItem()
wholeforuri = hxs.select("//@foruri").extract() # XPath to extract the foruri, which contains both the URL and id in foruri
for i in wholeforuri:
temp.append(i.rpartition(":"))
item["foruri"] = [i[0] for i in temp] # This contains the URL in foruri
item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri] # This contains the id in foruri
item['thisurl'] = response.url
item["thisid"] = hxs.select("//@foruri/../@id").extract()
item["rec"] = hxs.select("//@foruri/../@rec").extract()
return item
Kindly suggest what to do.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试理解为什么代码片段是这样写的:
请注意,在第 2 行中,您实际上需要在 Request.meta 中输入一个名为
FILTER_VISITED
的键,以便中间件删除请求。原因是好的,因为您访问过的每个网址都将被跳过,如果您不这样做,您将根本没有网址可以遍历。因此,FILTER_VISITED
实际上允许您选择要跳过的网址模式。如果您希望跳过特定规则提取链接,只需执行P.SI 不知道它是否适用于 0.14 及更高版本,因为一些代码已更改以在 sqlite 数据库中存储蜘蛛上下文。
try to understand why the snippet is written as it is:
Notice in line 2, you actually require a key in in Request.meta called
FILTER_VISITED
in order for the middleware to drop the request. The reason is well-intended because every single url you have visited will be skipped and you will not have urls to tranverse at all if you do not do so. So,FILTER_VISITED
actually allows you to choose what url patterns you want to skip. If you want links extracted with a particular rule skipped, just doP.S I do not know if it works for 0.14 and above as some of the code has changed for storing spider context in the sqlite db.