废纸停止遵循对特定目标的要求

发布于 01-23 16:23 字数 1318 浏览 2 评论 0原文

我的废除蜘蛛有许多独立的目标链接到爬网。

def start_requests(self):
    search_targets = get_search_targets()

    for search in search_targets:
        request = get_request(search.contract_type, search.postal_code, 1)
        yield request

每个链接将遵循的多个页面。 IE

def parse(self, response, **kwargs):
    # Some Logic depending on the response
    # ...

    if cur_page < num_pages:  # Following the link to the next page
        next_page = cur_page + 1
        request = get_request(contract_type, postal_code, next_page)
        yield request

    for estate_dict in estates:  # Parsing the items of response
        item = EstateItem()
        fill_item(item, estate_dict)
        yield item

现在每个链接（目标）在几页后会遇到重复，并且已经从以前的爬网中看到了项目。项目是否是重复的，可以在管道中确定，并对数据库进行查询。

def save_estate_item(self, item: EstateItem, session: Session):
    query = session.query(EstateModel)
    previous_item = query.filter_by(code=item['code']).first()

    if previous_item is not None:
        logging.info("Duplicate Estate")
        return
    
    # Save the item in the DB
    # ...

现在，当我找到重复的房地产时，我想停止关注该特定链接目标的页面，我该怎么做？我认为我会提出提高异常。Dropitem（'重复帖子'）在管道中使用有关完成的搜索目标的信息，并在我的蜘蛛中捕获该异常。 但是，我该如何告诉零工停止遵循该特定搜索目标的链接？

原文

My Scrapy spider has a bunch of independent target links to crawl.

def start_requests(self):
    search_targets = get_search_targets()

    for search in search_targets:
        request = get_request(search.contract_type, search.postal_code, 1)
        yield request

Each link multiple pages that will be followed. i.e.

def parse(self, response, **kwargs):
    # Some Logic depending on the response
    # ...

    if cur_page < num_pages:  # Following the link to the next page
        next_page = cur_page + 1
        request = get_request(contract_type, postal_code, next_page)
        yield request

    for estate_dict in estates:  # Parsing the items of response
        item = EstateItem()
        fill_item(item, estate_dict)
        yield item

Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.

def save_estate_item(self, item: EstateItem, session: Session):
    query = session.query(EstateModel)
    previous_item = query.filter_by(code=item['code']).first()

    if previous_item is not None:
        logging.info("Duplicate Estate")
        return
    
    # Save the item in the DB
    # ...

Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?

分享到QQ

分享到微博