了解服务器页面是否被修改

发布于 2024-10-15 07:33:19 字数 152 浏览 5 评论 0原文

我需要检查服务器是否修改了页面的内容,以便我可以再次检索该页面。我尝试使用 httpClient 方法使用标头响应的“Last-Modified”和“ETag”选项。但在许多页面中,这些值都缺失了。有没有其他方法可以在 JAVA 代码或任何开源工具中处理这个问题。

提前致谢

I need to check if the server has modified the contents of a page so that I can retrieve that page again. I tried using "Last-Modified" and "ETag" options of header response with httpClient method. But in many pages these values are missing. is there any other way to handle this in JAVA code or any opensource TOOL which does this.

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

白日梦 2024-10-22 07:33:19

唯一确定的方法是检索页面并亲自将新版本与旧版本进行比较。 Last-Modified 标头不可靠,因为它可能不存在,或者可能被某些不希望动态内容重新索引(无论出于何种原因)的网站故意欺骗。内容长度标头也可能不存在,因此您不一定依赖于此。

当您比较页面内容时,您必须决定您是否对所有更改感兴趣 - 或者只是对页面的相关内容区域进行更改,例如排除菜单、日期时间等动态元素...

如果与您自己进行比较,您可能可以只是检查各个文档的长度是否绝对相同,或者提取页面的相关内容区域并进行文本比较。要比较相似的页面,您还可以使用“sim-hash”,其中相似页面的哈希值数据彼此接近(与通常的稀疏散列相反)。

The only way to know for sure is to retrieve the page and compare the newer with older version yourself. The Last-Modified header is unreliable as it may not be present, or it could be deliberately spoofed by some sites that do not want dynamic content re-indexed (for whatever reason). The content-length header may also not be present, so you cann't necessarily rely on this.

When you compare page content you must decide whether you are interested in all changes - or just changes to relevant content areas of the page e.g. excluding dynamic elements such as menus, date-times etc...

If comparing yourself you could you could probably just check the lengths of the respective documents, for absolute sameness, or else extract the relevant content areas of the page and do a text comparison. To compare similar pages you could also use a "sim-hash", whereby the hash values for similar data are close to each other (as opposed to the usual sparse hashing).

时光匆匆的小流年 2024-10-22 07:33:19

如果实体尚未被修改,则 If-Modified-Since 标头将返回 HTTP 304;如果自标头中指定的日期以来实体已被修改,则将返回新实体。

http://www.w3.org/Protocols/rfc2616/rfc2616 -sec14.html#sec14.25

The If-Modified-Since header will return HTTP 304 if the entity has not been modified, and will return the new entity if it has been modified since the date specified in the header.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25

独留℉清风醉 2024-10-22 07:33:19

比较两个内容长度标头?如果以某种方式修改页面,它很可能不会完全相同。这不是完美的解决方案,但足以满足实际目的。

Compare two Content-length headers? It's is quite likely that it will not be exactly the same if the page is modified in some way. Not the perfect solution, but good enough for practical purposes considered.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文