了解服务器页面是否被修改
我需要检查服务器是否修改了页面的内容,以便我可以再次检索该页面。我尝试使用 httpClient 方法使用标头响应的“Last-Modified”和“ETag”选项。但在许多页面中,这些值都缺失了。有没有其他方法可以在 JAVA 代码或任何开源工具中处理这个问题。
提前致谢
I need to check if the server has modified the contents of a page so that I can retrieve that page again. I tried using "Last-Modified" and "ETag" options of header response with httpClient method. But in many pages these values are missing. is there any other way to handle this in JAVA code or any opensource TOOL which does this.
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
唯一确定的方法是检索页面并亲自将新版本与旧版本进行比较。 Last-Modified 标头不可靠,因为它可能不存在,或者可能被某些不希望动态内容重新索引(无论出于何种原因)的网站故意欺骗。内容长度标头也可能不存在,因此您不一定依赖于此。
当您比较页面内容时,您必须决定您是否对所有更改感兴趣 - 或者只是对页面的相关内容区域进行更改,例如排除菜单、日期时间等动态元素...
如果与您自己进行比较,您可能可以只是检查各个文档的长度是否绝对相同,或者提取页面的相关内容区域并进行文本比较。要比较相似的页面,您还可以使用“sim-hash”,其中相似页面的哈希值数据彼此接近(与通常的稀疏散列相反)。
The only way to know for sure is to retrieve the page and compare the newer with older version yourself. The Last-Modified header is unreliable as it may not be present, or it could be deliberately spoofed by some sites that do not want dynamic content re-indexed (for whatever reason). The content-length header may also not be present, so you cann't necessarily rely on this.
When you compare page content you must decide whether you are interested in all changes - or just changes to relevant content areas of the page e.g. excluding dynamic elements such as menus, date-times etc...
If comparing yourself you could you could probably just check the lengths of the respective documents, for absolute sameness, or else extract the relevant content areas of the page and do a text comparison. To compare similar pages you could also use a "sim-hash", whereby the hash values for similar data are close to each other (as opposed to the usual sparse hashing).
如果实体尚未被修改,则 If-Modified-Since 标头将返回 HTTP 304;如果自标头中指定的日期以来实体已被修改,则将返回新实体。
http://www.w3.org/Protocols/rfc2616/rfc2616 -sec14.html#sec14.25
The If-Modified-Since header will return HTTP 304 if the entity has not been modified, and will return the new entity if it has been modified since the date specified in the header.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.25
比较两个内容长度标头?如果以某种方式修改页面,它很可能不会完全相同。这不是完美的解决方案,但足以满足实际目的。
Compare two Content-length headers? It's is quite likely that it will not be exactly the same if the page is modified in some way. Not the perfect solution, but good enough for practical purposes considered.