Python:避免在爬行时下载未更改页面的最佳算法

发布于 2024-12-07 02:00:03 字数 594 浏览 1 评论 0原文

我正在编写一个爬虫,它定期检查新闻网站列表中的新文章。 我读过有关避免不必要的页面下载的不同方法,基本上确定了 5 个标头元素,可用于确定页面是否已更改:

  1. HTTP Status
  2. ETAG
  3. Last_modified (与 If-Modified-Since 请求结合)
  4. Expires
  5. Content-Length 。

优秀的 FeedParser.org 似乎实现了其中一些方法。

我正在寻找Python(或任何类似语言)中做出此类决定的最佳代码。 请记住,标头信息始终由服务器提供。

这可能是这样的:

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

I am writing a crawler which regularly inspects a list of news websites for new articles.
I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:

  1. HTTP Status
  2. ETAG
  3. Last_modified (to combine with If-Modified-Since request)
  4. Expires
  5. Content-Length.

The excellent FeedParser.org seems to implement some of these approaches.

I am looking for an optimal code in Python (or any similar language) that makes this kind of decision.
Keep in mind that header info is always provided by the server.

That could be something like :

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

时光暖心i 2024-12-14 02:00:03

在发出请求之前您唯一需要检查的是ExpiresIf-Modified-Since 不是服务器发送给您的内容,而是您发送给服务器的内容。

您想要做的是一个带有 If-Modified-Since 标头的 HTTP GET 标头,指示您上次检索资源的时间。如果您返回状态代码 304 而不是通常的 200,则该资源自那时以来尚未被修改,您应该使用您存储的副本(新副本将不会被修改)发送)。

此外,您应该保留上次检索文档时的 Expires 标头,如果您存储的文档副本尚未过期,则根本不要发出 GET

将其转换为 Python 留作练习,但向请求添加 If-Modified-Since 标头以存储响应中的 Expires 标头应该很简单,并检查响应中的状态代码。

The only thing you need to check before making the request is Expires. If-Modified-Since is not something the server sends you, but something you send the server.

What you want to do is an HTTP GET with an If-Modified-Since header indicating when you last retrieved the resource. If you get back status code 304 rather than the usual 200, the resource has not been modified since then, and you should use your stored copy (a new copy will not be sent).

Additionally, you should retain the Expires header from the last time you retrieved the document, and not issue the GET at all if your stored copy of the document has not expired.

Translating this into Python is left as an exercise, but it should be straightforward to add an If-Modified-Since header to a request, to store the Expires header from the response, and to check the status code from the response.

何处潇湘 2024-12-14 02:00:03

您需要将标头字典传递给 shouldDownload (或 urlopen 的结果):

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

在打开 URL 时执行此操作:

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

You would need to pass in a dict of headers to shouldDownload (or the result of urlopen):

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

Do that when you open the URL:

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文