Scrapy爬取时如何回溯?
我想编写一个通用爬虫,一次爬行一个网站。但是,我希望它在看到我认为相关的页面时回溯。 例如,我想从公司网站中提取招聘广告,并且我有一个可以对招聘广告页面进行分类的 ML 模型。当爬虫点击招聘广告并且模型预测这是招聘广告时,我希望它后退一步(希望到职业页面)并从那里导航到其他招聘广告页面。这可以在Scrapy的帮助下完成吗?
I want to write a general crawler that will crawl one site at a time. However, I would like it to backtrack when it has seen a page that I deem relevant.
For example, I want to extract job ads from a company website and I have an ML model that can classify a job ad page. When the crawler hits a job ad and the model is predicting that that is a job ad I would like it to go a step back (hopefully to the career page) and navigate from there to the other job ad pages. Is this possible to do with the help of Scrapy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由于 Scrapy 使用回调驱动的异步引擎,因此您无法真正隐式回溯。您需要利用
meta
关键字:您还可以访问使用
response.request
做出的请求响应,但是使用 meta 通常是更强大的解决方案。Since Scrapy is using callback driven asynchronous engine you can't really backtrack implicitly. You need to take advantage of
meta
keyword:You can also access request response was made with
response.request
however using meta is generally a more robust solution.