做一个好公民和网络抓取
我有一个由两部分组成的问题。
首先,我正在编写一个基于 的网络爬虫Scrapy 中的 CrawlSpider 蜘蛛。我的目标是抓取一个拥有数千条(可能是数十万条)记录的网站。这些记录埋藏在起始页以下 2-3 层。所以基本上我让蜘蛛从某个页面开始,爬行直到找到特定类型的记录,然后解析 html。我想知道有哪些方法可以防止我的蜘蛛使网站超载?有没有可能有一种方法可以增量地执行操作或在不同的请求之间暂停?
其次,相关的是,是否有一种方法可以使用 Scrapy 来测试爬虫而不会对网站造成过度的压力?我知道您可以在程序运行时终止该程序,但是有没有办法让脚本在点击包含我想要抓取的信息的第一页之类的内容后停止?
任何建议或资源将不胜感激。
I have a two part question.
First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?
Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?
Any advice or resources would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我正在使用 Scrapy 缓存功能来增量地抓取站点
HTTPCACHE_ENABLED = True
或者您可以使用新的 0.14 功能 作业:暂停和恢复抓取
检查此设置:
您可以尝试在 Scrapy shell 中调试代码
另外,您可以致电
Scrapy 文档是最好的资源。
I'm using Scrapy caching ability to scrape site incrementaly
HTTPCACHE_ENABLED = True
Or you can use new 0.14 feature Jobs: pausing and resuming crawls
check this settings:
You can try and debug your code in Scrapy shell
Also, you can call scrapy.shell.inspect_response at any time in your spider.
Scrapy documentation is the best resource.
您必须开始爬行并记录所有内容。如果您被禁止,您可以在页面请求之前添加 sleep() 。
更改 User-Agent 也是一个很好的做法(http://www.user-agents.org/ http://www .useragentstring.com/ )
如果被ip禁止,可以使用代理绕过它。干杯。
You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.
Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )
If you get banned by ip, use proxy to bypass it. Cheers.