抓取网站的请求数/第二标准是什么?
这是与我的问题最接近的问题,但在我看来,它并没有得到很好的回答:
我正在寻找问题#1 的答案:
您应该每秒执行多少个请求来进行抓取?
现在我从链接队列中提取。每个被抓取的网站都有自己的线程,并在请求之间休眠 1 秒。我要求 gzip 压缩以节省带宽。
这有标准吗?当然,所有大型搜索引擎在这方面都有一套遵循的准则。
This was the closest question to my question and it wasn't really answered very well imo:
I'm looking for the answer to #1:
How many requests/second should you be doing to scrape?
Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth.
Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有关网络爬行的维基百科文章提供了一些有关其他人正在做什么的信息:
我通常会尝试 5 秒,并带有一点随机性,这样看起来就不那么可疑了。
the wikipedia article on web crawling has some info about what others are doing:
I generally try 5 seconds with a bit of randomness so it looks less suspicious.
对此没有固定的标准,这取决于网页抓取造成的负载有多大。只要您没有明显影响其他用户的网站速度,它就应该是可接受的抓取速度。
由于网站上的用户数量和负载不断波动,因此动态调整抓取速度是个好主意。
监视下载每个页面的延迟,如果延迟开始增加,请开始降低抓取速度。本质上,网站的负载/延迟应该与您的抓取速度成反比。
There is no set standard for this, it depends on how much load the web scraping causes. As long as you aren't noticeably effecting the speed of the site for other users, it should be an acceptable scraping speed.
Since the amount of users and load on a website fluctuates constantly, it'd be a good idea to dynamically adjust your scraping speed.
Monitor the latency of downloading each page, and if the latency is starting to increase, start to decrease your scraping speed. Essentially, the website's load/latency should be inversely proportional to your scraping speed.
当我的客户/老板要求我做这样的事情时,我通常会先寻找公共 API,然后再诉诸公共网站的抓取。此外,联系网站所有者或技术联系人并请求许可可以将“停止和停止”信件的数量降至最低。
When my clients/boss ask me to do something like this I usually look for a public API before I resort to scraping of the public site. Also contacting the site owner or technical contact and asking permission to do so will keep the "cease and desist" letters to a minimum.