如何高效判断网页是否来自网站

发布于 2024-08-03 03:28:59 字数 229 浏览 2 评论 0原文

我有一些未知的网页,我想确定它们来自哪些网站。我有每个网站的示例网页,并且我假设每个网站都有一个独特的模板。 我不需要完全的确定性,也不想使用太多的资源来匹配每个网页。所以爬取每个网站的网页是不可能的。

我想最好的方法是比较每个网页 DOM 的树结构。有没有库可以做到这一点?

理想情况下,我想要一个基于 Python 的解决方案,但如果有一个我可以理解和实现的算法,那么我也会对此感兴趣。

谢谢

I have some unknown webpages and I want to determine which websites they come from. I have example webpages from each website and I assume each website has a distinctive template.
I do not need complete certainty, and don't want to use too much resources matching each webpage. So crawling each website for the webpage is out of the question.

I imagine the best way is to compare the tree structure of each webpage's DOM. Are there any libraries that will do this?

Ideally I am after a Python based solution, but if there is an algorithm I can understand and implement then I would be interested in that too.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

假情假意假温柔 2024-08-10 03:28:59

您可以通过贝叶斯分类来做到这一点。首先将每个站点的一些页面输入分类器,然后可以对未来的页面进行测试,看看它们的匹配程度。

贝叶斯分类器库可在此处找到: reverend (LGPL)

简化示例:

# initialisation
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('site one', site_one_page_one_data)
guesser.train('site one', site_one_page_two_data)
# ...etc...
guesser.train('site two', site_two_page_one_data)
guesser.train('site two', site_two_page_two_data)
# ...etc...
guesser.save()

# run time
guesser.load()
results = guesser.guess(page_I_want_to_classify)

为了获得更好的结果,标记化首先是 HTML。但这可能没有必要。

You could do this via Bayes classification. Feed a few pages from each site into the classifier first, then future pages can be tested against them to see how closely they match.

Bayes classifier library available here: reverend (LGPL)

Simplified example:

# initialisation
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('site one', site_one_page_one_data)
guesser.train('site one', site_one_page_two_data)
# ...etc...
guesser.train('site two', site_two_page_one_data)
guesser.train('site two', site_two_page_two_data)
# ...etc...
guesser.save()

# run time
guesser.load()
results = guesser.guess(page_I_want_to_classify)

For better results, tokenise the HTML first. But that might not be necessary.

终陌 2024-08-10 03:28:59

您可以尝试的一种快速但肮脏的方法是将 html 源代码拆分为 html 标记,然后比较生成的字符串集合。您最终应该收集标签和内容,例如:

item[n] ="<p>"
item[n+2] ="This is some content"
item[n+2] ="</p>"

我认为正则表达式几乎可以在每种语言中执行此操作。

除标签之外的某些内容是相同的(菜单等)。我认为对出现次数进行数字比较就足够了。当您在相同位置具有相同标签/内容时,您可以通过给予一定的“分数”来改进。也许大量收藏品的“组合”可以给你确定性。

A quick and dirty way you can try is to split html source in html tags, then compare the resultant collections of strings. You should end up with collection of tags and content, say:

item[n] ="<p>"
item[n+2] ="This is some content"
item[n+2] ="</p>"

I think a regex can do this in about every language.

Some content, other than tags, would be the same (menus and so on). I think a numeric comparison of occurrences should be enough. You can improve by giving kinda "points" when you have same tag/content in the same position. Probably a "combo" of a decent number of collection items can give you certainty.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文