对网站列表进行分类的最佳方法是什么？

发布于 2024-08-03 05:02:46 字数 135 浏览 3 评论 0原文

我有一个需要以某种方式分类的 X 站点列表。该网站是关于汽车、健康、产品还是关于一切（wikihow、about.com 等？）对此类网站进行分类的更好方法是什么？我应该获取为网站带来流量的关键字并使用它们吗？我应该阅读一些随机页面的内容并据此进行判断吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧瑾黎汐 2024-08-10 05:02:46

如果网站设计得很好，标题中就会有专门用于此的元标记。

回复收藏 0 原文

桃扇骨 2024-08-10 05:02:46

雅虎有一个API来提取术语， http://developer.yahoo.com/ search/content/V2/termExtraction.html

“术语提取 Web 服务提供从较大内容中提取的重要单词或短语的列表。它是 Y!Q 中使用的技术之一。”

回复收藏 0 原文

等待我真够勒 2024-08-10 05:02:46

也许我有点偏见（免责声明：我拥有图书馆学学位，这个主题是我获得学位的原因之一），所以最简单的答案是没有最好的方法。

就像数据库设计一样考虑这一点——一旦你的系统被填充，你会问它什么样的问题？

该网站由政府运营这一事实是否重要？或者它使用闪光灯？或者页面是蓝色的？或者说这是一个爱好者网站？或者目标受众是儿童？

然后我们会遇到这样的问题：是否要为我们所关心的任何方面建立一个层次类别 - 如果既涉及汽车又涉及摩托车，我们是否应该使用术语“车辆” ' 反而？如果我们这样做，我们是否会使用关键字扩展，以便“摩托车”也匹配更广泛的术语（即车辆）？

所以......重点是......弄清楚你的需求是什么，并朝着这个目标努力。即使经过多年的改进，“最好”也永远不会到来（如果有的话，它会变得更加困难，因为术语开始改变含义。还记得“博客”何时与网络服务器指标相关吗？）

回复收藏 0 原文

暮凉 2024-08-10 05:02:46

这是一个很难回答的问题。考虑：

您希望分类的粒度如何？
您希望根据您自己的标准还是网站提供的标准对网站进行分类？ 换句话说，如果某个网站将自己归类为“摩托车维护的主要来源”，您是否要专门为该网站创建“摩托车维护”类别？当然，这会导致您的列表变得不一致。但是，如果您按照自己的分类方案对站点进行分类，则会丢失信息，并且存在站点与您定义的任何类别不匹配的风险。
您允许子类别吗？如果是这样，问题就会变得更加复杂。
一个网站可以属于多个类别吗？如果是，是否有排序或权重（即主要类别、次要类别等），或者您是否遵循类似于 SO 标签的方案？

作为解决这个问题的初步尝试，我想我应该定义一组类别，然后蜘蛛抓取每个站点，跟踪每个类别名称出现的次数或其突变。然后，您可以选择“点击”次数最多的名称。

例如，给定以下类别：

{ "Cars", "Motorcycles", "Video Games" }

从站点抓取以下文本块：

这个标题与尤金·赫里格尔 (Eugen Herrigel) 所著的《射箭艺术中的禅》一书的书名不协调。波西格在介绍中解释说，尽管有这个标题，“它绝不应该与正统禅宗佛教修行相关的大量事实信息联系在一起。对于摩托车来说，它也不是很真实。 ”

和：

自 1980 年以来制造的大多数摩托车如果维护得当，都相当可靠，但这是一个很大的假设。在某种程度上，当今摩托车的高可靠性对许多骑手来说是不利的。一些骑手误以为摩托车就像现代汽车一样，基本上不需要维护。事实并非如此（即使是汽车）。现代自行车比 60 年代和 70 年代需要的维护更少，但它们仍然比汽车需要更多的维护。这种更高的可靠性也意味着有一大群摩托车手不知道如何骑自行车或真正需要做什么来确保可靠性。

我们得到以下分数：

{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }

因此，我们可以将该网站归类为主要与“摩托车”相关。

请注意，我在类别名称中提到了“其突变”，因此“摩托车”或“汽车”都被检测到。我们可以从中看出，您也许还应该考虑使用相关单词列表。例如，也许我们应该在搜索“摩托车”实例时检测“摩托车手”一词。也许我们也应该看看“现代自行车”。

您还可以保存这些点击，也许将它们与其他一些数据结合起来，并使用贝叶斯概率来确定该网站最有可能属于哪个类别。

This is a tough question to answer. Consider:

How granular do you want your classification to be?
Do you want to classify sites based on your own criteria or the criteria provided by the sites? In other words, if a site classifies itself as "a premier source for motorcycle maintenance", do you want to create a "motorcycle maintenance" category just for that site? This, of course, will cause your list to become inconsistent. However, if you pigeon-hole the sites to follow your own classification scheme, there is a loss of information, and a risk that the site will not match any of the categories you've defined.
Do you allow subcategories? The problem becomes much more complicated if so.
Can a site belong to more than one category? If so, is there an ordering or a weight (ie. Primary Category, Secondary Categories, etc.), or do you follow a scheme similar to SO's tags?

As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."

For instance, given the following categories:

{ "Cars", "Motorcycles", "Video Games" }

Spidering the following blocks of text from a site:

The title is an incongruous play on the title of the book Zen in the Art of Archery by Eugen Herrigel. In its introduction, Pirsig explains that, despite its title, "it should in no way be associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either."

and:

Most motorcycles made since 1980 are pretty reliable if properly maintained but that's a big if. To some extent the high reliability of today's motorcycles has worked to the disadvantage of many riders. Some riders have been lulled into believing that motorcycles are like modern cars and require essentially no maintenance. This is not the case (even with cars). Modern bikes require less maintenance than they did in the 60's and 70's but they still need a lot more maintence than a car. This higher reliability also means that there are a a whole bunch of motorcyclists out there who haven't a clue how to work on their bikes or what really needs to be done to ensure reliability.

We get the following scores:

{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }

And we can thus categorize the site as being related mostly to "Motorcycles".

Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.

You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.

回复收藏 0 原文

~没有更多了~