废纸停止遵循对特定目标的要求

发布于 01-23 16:23 字数 1318 浏览 2 评论 0原文

我的废除蜘蛛有许多独立的目标链接到爬网。

def start_requests(self):
    search_targets = get_search_targets()

    for search in search_targets:
        request = get_request(search.contract_type, search.postal_code, 1)
        yield request

每个链接将遵循的多个页面。 IE

def parse(self, response, **kwargs):
    # Some Logic depending on the response
    # ...

    if cur_page < num_pages:  # Following the link to the next page
        next_page = cur_page + 1
        request = get_request(contract_type, postal_code, next_page)
        yield request

    for estate_dict in estates:  # Parsing the items of response
        item = EstateItem()
        fill_item(item, estate_dict)
        yield item

现在每个链接(目标)在几页后会遇到重复,并且已经从以前的爬网中看到了项目。项目是否是重复的,可以在管道中确定,并对数据库进行查询。

def save_estate_item(self, item: EstateItem, session: Session):
    query = session.query(EstateModel)
    previous_item = query.filter_by(code=item['code']).first()

    if previous_item is not None:
        logging.info("Duplicate Estate")
        return
    
    # Save the item in the DB
    # ...

现在,当我找到重复的房地产时,我想停止关注该特定链接目标的页面,我该怎么做? 我认为我会提出提高异常。Dropitem('重复帖子')在管道中使用有关完成的搜索目标的信息,并在我的蜘蛛中捕获该异常。 但是,我该如何告诉零工停止遵循该特定搜索目标的链接?

My Scrapy spider has a bunch of independent target links to crawl.

def start_requests(self):
    search_targets = get_search_targets()

    for search in search_targets:
        request = get_request(search.contract_type, search.postal_code, 1)
        yield request

Each link multiple pages that will be followed. i.e.

def parse(self, response, **kwargs):
    # Some Logic depending on the response
    # ...

    if cur_page < num_pages:  # Following the link to the next page
        next_page = cur_page + 1
        request = get_request(contract_type, postal_code, next_page)
        yield request

    for estate_dict in estates:  # Parsing the items of response
        item = EstateItem()
        fill_item(item, estate_dict)
        yield item

Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.

def save_estate_item(self, item: EstateItem, session: Session):
    query = session.query(EstateModel)
    previous_item = query.filter_by(code=item['code']).first()

    if previous_item is not None:
        logging.info("Duplicate Estate")
        return
    
    # Save the item in the DB
    # ...

Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post') in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文