HTML 页面有什么独特之处?
我的问题主要是关于验证。什么可以用来确定 HTML 文档中的唯一性? (文档可以具有一定程度的动态性。)
可以使用或生成什么来识别页面是正确的页面,准确度达到 99%,考虑到您可以存储某种“指纹”您正在验证的页面的?
为了清楚起见,这是加密/https 等的附加功能。该页面可以并且将会根据特定用户随动态内容而变化,但是指纹也可以,但由于其性质,单个指纹不能 100% 匹配 100% 的用户的动态内容。因此,哈希在这里不起作用,至少不能以简单的形式起作用。
My question is about verification more than anything else. What can be used to determine what is unique in an HTML document? (The document can have a degree of being dynamic.)
What is able to be used, or generated to recognize that a page is the correct page to an accuracy of say 99%, taking into consideration you can store a "fingerprint" of sorts of the page you are verifying?
For clarity, this is an added extra to encryption/https etc. This page can and will change with dynamic content according to specific users, however so can the fingerprint, but a single fingerprint cannot 100% match 100% of users due to the nature of dynamic content. Therefore a hash cannot work here, at least not in a simplistic form.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
HTML 页面的唯一指纹很容易计算。根据以下内容构建哈希:
可选的一些标头:
Server
Content-Type
这是重要的Content-encoding
这可能太多这假设您没有将任何数据发布到页面。
A unique fingerprint of a HTML page is easy to calculate. Build a hash from the following:
Optionally some headers:
Server
Content-Type
this is importantContent-encoding
this probably toothis assumes you're not POSTing any data to pages.
如果你不检查主机的 IP,你连 1% 的把握都没有。
接下来是加密。 (如果没有这个,您可能会成为 ARP 中毒的受害者(仅在 LAN 网络中))。
HTTPS 中的密钥必须始终相同。
如果它发生变化,则意味着有人在作弊或密钥已更新(密钥有过期日期。)
You can't be even 1% sure if you won't check IP of host.
The next is encryption. (Without this you can be a victim of ARP poisoing (only in lan networks)).
The key in HTTPS has to be the same all the time.
If it changes it means that someone is cheating or the key got update (the keys have their expiration date.)
页面的指纹是主机名、端口和路径。这是唯一保证在网络上唯一的东西。我想您还可以包含缓存标头(Last-Modified)以查看它是否更改。
最重要的是,如果你对 html 进行哈希处理,即使最后修改的标头发生了变化,你也可以看到它是否发生了变化。
The fingerprint of the page is the host-name,port,and path. That is the only thing guaranteed to be unique across the web. I suppose you could also include the cache headers (Last-Modified) to see if it changed.
On top of this if you hashed the html you could see if it changed even if last-modified header changed.
假设您想要存储 HTML 页面的“指纹”,以便稍后在它完全匹配时识别它,只需使用 HTML 页面的简单哈希摘要即可。
除非你进一步澄清这个问题,否则我看不出它是 HTML 或它所在的浏览器有什么关系。
但是,这不会告诉你该页面是否位于同一位置。为此,您需要存储其他详细信息,例如主机/IP 和路径。
Assuming for a minute that you want to store a 'fingerprint' of a HTML page so you can recognise it later if it exactly matches, just use a simple hash digest of the HTML page.
Unless you clearify the question more, I can see no reason of why it should matter that it is HTML or what browser it is in.
This won't tell you if the page is at the same location however. For that you would need to store additional details such as host/ip and path.
如果您可以获得两个页面的文本版本,您可以比较它们。您可以确定页面差异可接受的最大范围。
有一个 Unix util(称为 diff)。网上也有该工具的 win32 版本。维基百科有一篇关于 diff 的文章: http://en.wikipedia.org/wiki/Diff 。
wiki 文章列出了免费的文件比较工具,“另请参阅”部分包含讨论文件比较工具和增量编码的其他文章的链接。
“编辑距离度量”也可能是一种有趣的方法。
CodeProject 上有一个不错的 C# Difference 引擎。由于我的低点,我无法发布另一个链接,但文章标题是:“A Generic, Reusable Diff Algorithm in C#”。
If you can get the text versions of the two pages you could diff them. You could determine a maximum range acceptable for differences in the page.
There is a Unix util (called diff). There are win32 versions of this tool floating around the net also. Wikipedia has an article on diff: http://en.wikipedia.org/wiki/Diff.
The wiki article lists free file comparison tools and the "See also" section has links to other articles that discuss file comparison tools and delta encoding.
The "Levenshtein distance metric" may also be an interesting approach.
There is a decent C# Difference engine on CodeProject. I can't post another link due to my low points but the article title is: "A Generic, Reusable Diff Algorithm in C#".
即使您有确切的主机名、端口和路径,如果有应用服务器为网页提供服务或者网络服务器正在插入广告内容,内容仍然可能会有所不同。
如果您能够可靠地识别 HTML 中动态的部分(例如不断更新的广告或时间戳),那么我会首先对数据进行标准化。我会删除所有空格字符(空格、制表符、换行符),然后对该内容进行哈希处理。
我不会在哈希中包含主机名端口路径,因为这不会向“指纹”添加任何内容。 (当您稍后必须重新查询 Web 服务器来比较 HTML 时,该信息会很有用。)
Even if you had the exact hostname, port, and path the content still could be different if there is an app server serving the web pages or if the web server is inserting ad content.
If you could reliably identify the parts of the HTML that are dynamic (like ads or timestamps that keep updating) then I would normalize the data first. I'd strip out all space characters (spaces, tabs, newlines) then make a hash of that content.
I would not include the hostname-port-path in the hash because that wouldn't add anything to the "fingerprint". (That info is useful later when you have to requery the web server later to compare the HTML.)