Python:避免在爬行时下载未更改页面的最佳算法
我正在编写一个爬虫,它定期检查新闻网站列表中的新文章。 我读过有关避免不必要的页面下载的不同方法,基本上确定了 5 个标头元素,可用于确定页面是否已更改:
- HTTP Status
- ETAG
- Last_modified (与 If-Modified-Since 请求结合)
- Expires
- Content-Length 。
优秀的 FeedParser.org 似乎实现了其中一些方法。
我正在寻找Python(或任何类似语言)中做出此类决定的最佳代码。 请记住,标头信息始终由服务器提供。
这可能是这样的:
def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
#retrieve the headers, do the magic here and return the decision
return decision
I am writing a crawler which regularly inspects a list of news websites for new articles.
I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:
- HTTP Status
- ETAG
- Last_modified (to combine with If-Modified-Since request)
- Expires
- Content-Length.
The excellent FeedParser.org seems to implement some of these approaches.
I am looking for an optimal code in Python (or any similar language) that makes this kind of decision.
Keep in mind that header info is always provided by the server.
That could be something like :
def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
#retrieve the headers, do the magic here and return the decision
return decision
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在发出请求之前您唯一需要检查的是
Expires
。If-Modified-Since
不是服务器发送给您的内容,而是您发送给服务器的内容。您想要做的是一个带有
If-Modified-Since
标头的 HTTPGET
标头,指示您上次检索资源的时间。如果您返回状态代码304
而不是通常的200
,则该资源自那时以来尚未被修改,您应该使用您存储的副本(新副本将不会被修改)发送)。此外,您应该保留上次检索文档时的
Expires
标头,如果您存储的文档副本尚未过期,则根本不要发出GET
。将其转换为 Python 留作练习,但向请求添加
If-Modified-Since
标头以存储响应中的Expires
标头应该很简单,并检查响应中的状态代码。The only thing you need to check before making the request is
Expires
.If-Modified-Since
is not something the server sends you, but something you send the server.What you want to do is an HTTP
GET
with anIf-Modified-Since
header indicating when you last retrieved the resource. If you get back status code304
rather than the usual200
, the resource has not been modified since then, and you should use your stored copy (a new copy will not be sent).Additionally, you should retain the
Expires
header from the last time you retrieved the document, and not issue theGET
at all if your stored copy of the document has not expired.Translating this into Python is left as an exercise, but it should be straightforward to add an
If-Modified-Since
header to a request, to store theExpires
header from the response, and to check the status code from the response.您需要将标头字典传递给
shouldDownload
(或urlopen
的结果):在打开 URL 时执行此操作:
You would need to pass in a dict of headers to
shouldDownload
(or the result ofurlopen
):Do that when you open the URL: