抓取网站的请求数/第二标准是什么?

发布于 2024-09-03 14:55:37 字数 293 浏览 0 评论 0原文

这是与我的问题最接近的问题,但在我看来,它并没有得到很好的回答:

网络抓取礼仪

我正在寻找问题#1 的答案:

您应该每秒执行多少个请求来进行抓取?

现在我从链接队列中提取。每个被抓取的网站都有自己的线程,并在请求之间休眠 1 秒。我要求 gzip 压缩以节省带宽。

这有标准吗?当然,所有大型搜索引擎在这方面都有一套遵循的准则。

This was the closest question to my question and it wasn't really answered very well imo:

Web scraping etiquette

I'm looking for the answer to #1:

How many requests/second should you be doing to scrape?

Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth.

Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

许久 2024-09-10 14:55:37

有关网络爬行的维基百科文章提供了一些有关其他人正在做什么的信息:

Cho[22] 使用 10 秒作为
访问间隔和 WIRE
爬虫[28]使用15秒作为
默认。墨卡托网络爬虫
遵循适应性礼貌政策:
如果下载一个文件需要 t 秒
来自给定服务器的文档,
爬虫之前等待10t秒
下载下一页。[29]莳萝
等人。 [30]使用1秒。

我通常会尝试 5 秒,并带有一点随机性,这样看起来就不那么可疑了。

the wikipedia article on web crawling has some info about what others are doing:

Cho[22] uses 10 seconds as an
interval for accesses, and the WIRE
crawler [28] uses 15 seconds as the
default. The MercatorWeb crawler
follows an adaptive politeness policy:
if it took t seconds to download a
document from a given server, the
crawler waits for 10t seconds before
downloading the next page.[29] Dill
et al. [30] use 1 second.

I generally try 5 seconds with a bit of randomness so it looks less suspicious.

可可 2024-09-10 14:55:37

对此没有固定的标准,这取决于网页抓取造成的负载有多大。只要您没有明显影响其他用户的网站速度,它就应该是可接受的抓取速度。

由于网站上的用户数量和负载不断波动,因此动态调整抓取速度是个好主意。

监视下载每个页面的延迟,如果延迟开始增加,请开始降低抓取速度。本质上,网站的负载/延迟应该与您的抓取速度成反比。

There is no set standard for this, it depends on how much load the web scraping causes. As long as you aren't noticeably effecting the speed of the site for other users, it should be an acceptable scraping speed.

Since the amount of users and load on a website fluctuates constantly, it'd be a good idea to dynamically adjust your scraping speed.

Monitor the latency of downloading each page, and if the latency is starting to increase, start to decrease your scraping speed. Essentially, the website's load/latency should be inversely proportional to your scraping speed.

花心好男孩 2024-09-10 14:55:37

当我的客户/老板要求我做这样的事情时,我通常会先寻找公共 API,然后再诉诸公共网站的抓取。此外,联系网站所有者或技术联系人并请求许可可以将“停止和停止”信件的数量降至最低。

When my clients/boss ask me to do something like this I usually look for a public API before I resort to scraping of the public site. Also contacting the site owner or technical contact and asking permission to do so will keep the "cease and desist" letters to a minimum.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文