废纸停止遵循对特定目标的要求
我的废除蜘蛛有许多独立的目标链接到爬网。
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
每个链接将遵循的多个页面。 IE
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
现在每个链接(目标)在几页后会遇到重复,并且已经从以前的爬网中看到了项目。项目是否是重复的,可以在管道中确定,并对数据库进行查询。
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
现在,当我找到重复的房地产时,我想停止关注该特定链接目标的页面,我该怎么做? 我认为我会提出提高异常。Dropitem('重复帖子')
在管道中使用有关完成的搜索目标的信息,并在我的蜘蛛中捕获该异常。 但是,我该如何告诉零工停止遵循该特定搜索目标的链接?
My Scrapy spider has a bunch of independent target links to crawl.
def start_requests(self):
search_targets = get_search_targets()
for search in search_targets:
request = get_request(search.contract_type, search.postal_code, 1)
yield request
Each link multiple pages that will be followed. i.e.
def parse(self, response, **kwargs):
# Some Logic depending on the response
# ...
if cur_page < num_pages: # Following the link to the next page
next_page = cur_page + 1
request = get_request(contract_type, postal_code, next_page)
yield request
for estate_dict in estates: # Parsing the items of response
item = EstateItem()
fill_item(item, estate_dict)
yield item
Now each link (target) after a few pages will encounter duplicate and already seen items from previous crawls. Whether an item is a duplicate is decided in the pipeline, with a query to the database.
def save_estate_item(self, item: EstateItem, session: Session):
query = session.query(EstateModel)
previous_item = query.filter_by(code=item['code']).first()
if previous_item is not None:
logging.info("Duplicate Estate")
return
# Save the item in the DB
# ...
Now here when I find a duplicate estate, I want Scrapy to stop following pages for that specific link target, How could I do that?
I figured I would raise raise exceptions.DropItem('Duplicate post')
in the pipeline with the info about the finished search target, and catch that exception in my spider. But how could I tell scrapy to stop following links for that specific search target?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论