爬虫运行两次时会产生重复项吗?

发布于 2024-10-22 15:43:02 字数 1923 浏览 3 评论 0原文

我在 python 中使用爬虫框架“scrapy”,并使用 pipelines.py 文件将我的项目以 json 格式存储到文件中。执行此操作的代码如下所示 它

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

问题是,当我运行爬虫两次(比如说)时,在我的文件中,我得到重复的抓取项目。我尝试通过先从文件中读取数据,然后将数据与要写入的新数据进行匹配,但读取的数据来阻止 文件中的内容是 json 格式,所以我用 json.loads() 函数对其进行解码,但它不起作用:

import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())
    
    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file
    
             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .

请建议一种方法来执行此操作。

注意:请注意,我必须以“追加”模式打开文件,因为我可能会抓取一组不同的链接,但使用相同的 start_url 运行爬虫两次应该将相同的数据写入文件两次

I use the crawler framework "scrapy" in python and I use the pipelines.py file to store my items in a json format to a file.The code for doing this is given below
import json

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

The problem is when I run my crawler twice(say) then in my file I get duplicate scraped items .I tried prevented it by reading from the file first and then matching the data with the new data to be written but the data read from the file was a json format ,so then I decoded it with json.loads() function but it doesn't work:

import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())
    
    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file
    
             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .

Please suggest a method to do this.

Note: Please note that I have to open the file in "append" mode since I may crawl a different set of links but running the crawler twice with same start_url should write the same data to the file twice

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏至、离别 2024-10-29 15:43:03

您可以使用一些自定义中间件过滤掉重复项,例如this。不过,要在蜘蛛中实际使用它,您还需要两件事:某种为项目分配 id 的方法,以便过滤器可以识别重复项,以及某种在蜘蛛运行之间保留访问过的 id 集的方法。第二个很简单——您可以使用像 shelve 这样的 Python 风格的东西,或者您可以使用当今流行的许多键值存储之一。不过,第一部分会比较困难,并且取决于您要解决的问题。

You can filter out duplicates by using some custom middleware, e.g, this. To actually use this in your spider, though, you'll need two more things: some way of assigning ids to items so that you the filter can identify duplicates, and some way of persisting the set of visited ids between spider runs. The second is easy -- you could use something pythonic like shelve, or you could use one of the many key-value stores that are popular these days. The first part is going to be harder, though, and will depend on the problem you're trying to solve.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文