Python：避免在爬行时下载未更改页面的最佳算法

发布于 2024-12-07 02:00:03 字数 594 浏览 8 评论 0原文

我正在编写一个爬虫，它定期检查新闻网站列表中的新文章。我读过有关避免不必要的页面下载的不同方法，基本上确定了 5 个标头元素，可用于确定页面是否已更改：

HTTP Status
ETAG
Last_modified （与 If-Modified-Since 请求结合）
Expires
Content-Length 。

优秀的 FeedParser.org 似乎实现了其中一些方法。

我正在寻找Python（或任何类似语言）中做出此类决定的最佳代码。请记住，标头信息始终由服务器提供。

这可能是这样的：

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

原文

I am writing a crawler which regularly inspects a list of news websites for new articles.
I have read about different approaches for avoiding unnecessary pages downloads, basically identified 5 header elements that could be useful to determine if the page has changed or not:

HTTP Status
ETAG
Last_modified (to combine with If-Modified-Since request)
Expires
Content-Length.

The excellent FeedParser.org seems to implement some of these approaches.

I am looking for an optimal code in Python (or any similar language) that makes this kind of decision.
Keep in mind that header info is always provided by the server.

That could be something like :

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光暖心i 2024-12-14 02:00:03

在发出请求之前您唯一需要检查的是Expires。 If-Modified-Since 不是服务器发送给您的内容，而是您发送给服务器的内容。

您想要做的是一个带有 If-Modified-Since 标头的 HTTP GET 标头，指示您上次检索资源的时间。如果您返回状态代码 304 而不是通常的 200，则该资源自那时以来尚未被修改，您应该使用您存储的副本（新副本将不会被修改）发送）。

此外，您应该保留上次检索文档时的 Expires 标头，如果您存储的文档副本尚未过期，则根本不要发出 GET。

将其转换为 Python 留作练习，但向请求添加 If-Modified-Since 标头以存储响应中的 Expires 标头应该很简单，并检查响应中的状态代码。

回复收藏 0 原文

何处潇湘 2024-12-14 02:00:03

您需要将标头字典传递给 shouldDownload （或 urlopen 的结果）：

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

在打开 URL 时执行此操作：

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

You would need to pass in a dict of headers to shouldDownload (or the result of urlopen):

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

Do that when you open the URL:

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

回复收藏 0 原文

~没有更多了~