如何高效判断网页是否来自网站
我有一些未知的网页,我想确定它们来自哪些网站。我有每个网站的示例网页,并且我假设每个网站都有一个独特的模板。 我不需要完全的确定性,也不想使用太多的资源来匹配每个网页。所以爬取每个网站的网页是不可能的。
我想最好的方法是比较每个网页 DOM 的树结构。有没有库可以做到这一点?
理想情况下,我想要一个基于 Python 的解决方案,但如果有一个我可以理解和实现的算法,那么我也会对此感兴趣。
谢谢
I have some unknown webpages and I want to determine which websites they come from. I have example webpages from each website and I assume each website has a distinctive template.
I do not need complete certainty, and don't want to use too much resources matching each webpage. So crawling each website for the webpage is out of the question.
I imagine the best way is to compare the tree structure of each webpage's DOM. Are there any libraries that will do this?
Ideally I am after a Python based solution, but if there is an algorithm I can understand and implement then I would be interested in that too.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以通过贝叶斯分类来做到这一点。首先将每个站点的一些页面输入分类器,然后可以对未来的页面进行测试,看看它们的匹配程度。
贝叶斯分类器库可在此处找到: reverend (LGPL)
简化示例:
为了获得更好的结果,标记化首先是 HTML。但这可能没有必要。
You could do this via Bayes classification. Feed a few pages from each site into the classifier first, then future pages can be tested against them to see how closely they match.
Bayes classifier library available here: reverend (LGPL)
Simplified example:
For better results, tokenise the HTML first. But that might not be necessary.
您可以尝试的一种快速但肮脏的方法是将 html 源代码拆分为 html 标记,然后比较生成的字符串集合。您最终应该收集标签和内容,例如:
我认为正则表达式几乎可以在每种语言中执行此操作。
除标签之外的某些内容是相同的(菜单等)。我认为对出现次数进行数字比较就足够了。当您在相同位置具有相同标签/内容时,您可以通过给予一定的“分数”来改进。也许大量收藏品的“组合”可以给你确定性。
A quick and dirty way you can try is to split html source in html tags, then compare the resultant collections of strings. You should end up with collection of tags and content, say:
I think a regex can do this in about every language.
Some content, other than tags, would be the same (menus and so on). I think a numeric comparison of occurrences should be enough. You can improve by giving kinda "points" when you have same tag/content in the same position. Probably a "combo" of a decent number of collection items can give you certainty.