对网站列表进行分类的最佳方法是什么?
我有一个需要以某种方式分类的 X 站点列表。该网站是关于汽车、健康、产品还是关于一切(wikihow、about.com 等?)对此类网站进行分类的更好方法是什么?我应该获取为网站带来流量的关键字并使用它们吗?我应该阅读一些随机页面的内容并据此进行判断吗?
I have a list of X sites that I need to classify in some way. Is the site about cars, health, products or is it about everything(wikihow, about.com, etc?) What are some of the better ways to classify sites like this? Should I get keywords that bring traffic to the site and use those? Should I read the content of some random pages and judge it off of that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果网站设计得很好,标题中就会有专门用于此的元标记。
Well if the site is well designed there will be meta tags in the header specifically for this.
雅虎有一个API来提取术语, http://developer.yahoo.com/ search/content/V2/termExtraction.html
“术语提取 Web 服务提供从较大内容中提取的重要单词或短语的列表。它是 Y!Q 中使用的技术之一。”
Yahoo has a api to extract terms, http://developer.yahoo.com/search/content/V2/termExtraction.html
"The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content. It is one of the technologies used in Y!Q."
也许我有点偏见(免责声明:我拥有图书馆学学位,这个主题是我获得学位的原因之一),所以最简单的答案是没有最好的方法。
就像数据库设计一样考虑这一点——一旦你的系统被填充,你会问它什么样的问题?
该网站由政府运营这一事实是否重要?或者它使用闪光灯?或者页面是蓝色的?或者说这是一个爱好者网站?或者目标受众是儿童?
然后我们会遇到这样的问题:是否要为我们所关心的任何方面建立一个层次类别 - 如果既涉及汽车又涉及摩托车,我们是否应该使用术语“车辆” ' 反而?如果我们这样做,我们是否会使用关键字扩展,以便“摩托车”也匹配更广泛的术语(即车辆)?
所以......重点是......弄清楚你的需求是什么,并朝着这个目标努力。即使经过多年的改进,“最好”也永远不会到来(如果有的话,它会变得更加困难,因为术语开始改变含义。还记得“博客”何时与网络服务器指标相关吗?)
Maybe I'm a bit biased (disclaimer : I have a degree in library science, and this topic is one of the reasons I got the degree), so the easiest answer is that there is no best way.
Consider this like you would database design -- once you have your system populated, what sort of questions are you going to ask of it?
Is the fact that the site is run by the government significant? Or that it uses flash? Or that the pages are blue? Or that it's a hobbyist site? Or that the intended audience is children?.
Then we get the question of if we're going to have a hierarchical category for any of the facets we're concerned with -- if it's about both cars and motorcycles, should we use the term 'vehicles' instead? And if we do that, will we use keyword expansion so that 'motorcycle' matches the broader terms (ie, vehicles) as well?
So ... the point is ... figure out what your needs are, and work towards that. 'Best' will never come, even with years of refinement (if anything, it gets more difficult, as terms start changing meanings. Remember when 'weblog' was related to web server metrics?)
这是一个很难回答的问题。考虑:
作为解决这个问题的初步尝试,我想我应该定义一组类别,然后蜘蛛抓取每个站点,跟踪每个类别名称出现的次数或其突变。然后,您可以选择“点击”次数最多的名称。
例如,给定以下类别:
从站点抓取以下文本块:
和:
我们得到以下分数:
因此,我们可以将该网站归类为主要与“摩托车”相关。
请注意,我在类别名称中提到了“其突变”,因此“摩托车”或“汽车”都被检测到。我们可以从中看出,您也许还应该考虑使用相关单词列表。例如,也许我们应该在搜索“摩托车”实例时检测“摩托车手”一词。也许我们也应该看看“现代自行车”。
您还可以保存这些点击,也许将它们与其他一些数据结合起来,并使用贝叶斯概率来确定该网站最有可能属于哪个类别。
This is a tough question to answer. Consider:
As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."
For instance, given the following categories:
Spidering the following blocks of text from a site:
and:
We get the following scores:
And we can thus categorize the site as being related mostly to "Motorcycles".
Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.
You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.