从网页分类网站类型
是否有任何可靠/部署的方法、算法或工具来通过解析某些网页来标记网站类型。
例如:论坛、博客、PressRelease 网站、新闻、电子商务等。
我正在寻找一些明确定义的特征(静态规则),从中可以确定这一点。如果没有,那么我希望机器学习模型可以有所帮助。
建议/想法?
Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.
For ex: forums, blogs, PressRelease sites, news, E-Comm etc.
I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.
Suggestions/Ideas ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果从机器学习的角度来看这个问题,朴素贝叶斯分类器可能具有最大的工作/回报比。它的一个版本在 Winnow 中用于对新闻文章进行分类。
您将需要一组页面,每个页面都标有其正确的类别。然后,您从每个页面中提取单词或其他相关元素,并将它们用作
Dr.Dobbs 的特征 关于实施朴素贝叶斯的文章
If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.
You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features
Dr.Dobbs has an article on implementing Naive Bayes
如果您有兴趣采用朴素贝叶斯方法(毕竟还有其他机器学习选项),那么我建议您阅读以下文档,该文档遵循“数据挖掘:实用机器学习工具和技术”中对该主题的介绍,作者:Witten 和 Frank:
http://www.coli.uni -sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf
If you're interested in persuing the naïve Bayes approach (there are other machine learning options, after all), then I suggest the following document, which follows the coverage of this subject in "Data Mining: Practical Machine Learning Tools and Techniques", by Witten and Frank:
http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf