我正在与一位图书管理员合作,重新构建其组织的数字摄影档案。
我用 Mechanize 和 BeautifulSoup 从集合中提取约 7000 个结构不良且轻微不正确/不完整的文档。数据将被格式化为电子表格,他可以用它来纠正它。现在,我估计总共有 7500 个 HTTP 请求来构建搜索字典,然后收集数据,这还不包括代码中的错误和重做,随着项目的进展,还会有更多请求。
我假设我发出这些请求的速度存在某种内置限制,即使没有,我也会让我的机器人延迟,以礼貌地对待负担过重的网络服务器。我的问题(诚然不可能完全准确地回答)是在遇到内置速率限制之前我可以多快发出 HTTP 请求?
我不想发布我们的域的 URL重新抓取,但如果相关的话我会问我的朋友是否可以分享。
注意:我意识到这不是解决我们问题(重新构建/组织数据库)的最佳方法,但我们正在构建一个概念验证来说服上级相信我的朋友有一份数据库副本,他将通过该数据库浏览必要的官僚机构,以便我可以直接处理数据。
他们还为我们提供了 ATOM feed 的 API,但它需要关键字来搜索,并且对于逐步浏览特定集合中的每张照片的任务来说似乎毫无用处。
I'm working with a librarian to re-structure his organization's digital photography archive.
I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.
I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?
I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.
Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.
They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.
发布评论
评论(1)
HTTP 没有内置的速率限制。大多数常见的 Web 服务器都没有配置开箱即用的速率限制。如果速率限制到位,则几乎可以肯定它是由网站管理员设置的,您必须询问他们配置了什么。
某些搜索引擎尊重 robots.txt 的非标准扩展,该扩展建议速率限制,因此请检查
robots.txt
中的Crawl-delay
。HTTP 确实有两个连接的并发连接限制,但浏览器已经开始忽略这一点,并且正在努力修订标准的这一部分,因为它已经过时了。
There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.
Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for
Crawl-delay
inrobots.txt
.HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.