余弦相似度测量:多个结果
我的程序使用聚类来生成相似项目的子集,然后使用余弦相似性度量作为确定聚类相似程度的方法。例如,如果用户 1 有 3 个聚类,用户 2 有 3 个聚类,则每个聚类都会相互比较,将产生使用余弦相似性度量的 9 个结果,例如 [0.3, 0.1, 0.4, 0.12, 0.0, 0.6, 0.8, 1.0, 0.22]
我的问题是,根据这些结果,我如何将这些值转化为有形的结果来显示这两个用户的相似程度?
我制作的一个简单方法是将所有值除以比较次数,然后将它们加在一起得到 1 个值,但这是一个非常简单的方法。
谢谢,
AS
我想要实现的基本描述是是否可以从社交书签网络服务 Delicious.com 的书签和标签中确定两个用户的相似程度。
到目前为止,我已经根据用户书签的标签和每个标签的共现创建了集群,例如一个集群可能是:
fruit: (apple, 15), (orange, 9), (kiwi, 2)
而另一个用户可能有一个从他们的标签生成的类似集群:
fruit: (apple, 12), (strawberry, 7), (orange, 3)
数字代表标签的次数在已保存的书签中与本示例中的标签“fruit”同时出现。
我使用余弦相似性度量来比较这些集群,以确定它们的相似程度,并且从我最初的问题来看,有许多集群比较结果(将每个用户集群与另一个用户集群进行比较),我不确定如何聚合结果以生成有意义的结果。
很可能我一直不正确地使用余弦相似度,
My program uses clustering to produce subsets of similar items and then uses the cosine similarity measure as a method of determining how similar the clusters are. For instance if user 1 has 3 clusters and user 2 has 3 clusters then every cluster is compared against each other, 9 results using the cosine similarity measure will be produced, e.g. [0.3, 0.1, 0.4, 0.12, 0.0, 0.6, 0.8, 1.0, 0.22]
My problem is, based on these results how can I turn these values into a tangible result to show how similar these two users are?
A simple method I produced was to just divide all the values by the number of comparisons and add them together to get 1 value but this is quite a simple approach.
Thanks,
AS
The basic description of what I am trying to achieve is whether it is possible to determine how similar two users, from the social bookmarking webservice Delicious.com, from their bookmarks and tags.
Thus far I have created clusters from the tags of a users bookmarks and the co-occurrences of each tag, for instance one cluster could be:
fruit: (apple, 15), (orange, 9), (kiwi, 2)
and another user may have a similar cluster produced from their tags:
fruit: (apple, 12), (strawberry, 7), (orange, 3)
The number represents how many times the tag co-occurred, in a saved bookmark, with the tag, "fruit" in this example.
I have used the cosine similarity measure to compare these clusters to determine how similar they are, and from my initial question, with many cluster comparison results (comparing every users clusters against another users clusters) I am unsure how to aggregate the results to producing a meaningful result.
It's very possible that I have been using the cosine similarity improperly,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
问题定义不明确......有了更多细节,就可以提供有关该方法的有效性的评论,一般来说(使用余弦相似度,其计算方式等)以及该方法的有效性用于汇总最终结果的方法。
本质上,您平均为每对集群(Ca,Cb)计算的余弦相似度值,其中Ca是用户A“拥有”的集群,Cb是用户A“拥有”的集群B“有”。
我猜测通过使用加权平均值可以大大改善这一点,该平均值将考虑用户可以展示的集群的“拥有”量。
也许这种“拥有”关系纯粹是布尔型的:用户要么拥有或不拥有特定的集群,但他/她的“拥有”很可能可以用[有序]分类属性甚至数值来限定(无论是相对的:与他/她拥有的其他集群相比,给定用户拥有的给定集群的百分比,或者是绝对的)。
由于每个余弦相似度都基于用户“A”拥有的集群和用户“B”拥有的集群,因此如果正确归一化,则可以获取相应“拥有”度量的乘积作为应用于平均计算中相应余弦相似度项的系数。以这种方式,如果两个用户实际上相似,但其中一个恰好有一个或两个额外的集群,并且“拥有”因素非常低,那么聚合结果不会受到太大影响。
一般来说,距离计算(例如余弦相似度)以及聚合公式(例如平均值或加权平均值)对各个维度的规模(及其相对“重要性”)非常敏感。因此,通常很难提供诸如上述的通用建议。理论对于分类问题非常重要,但需要注意不要“盲目”应用公式:很容易因树而失林;-)
为了帮助改进问题,以下是我的大致理解,请补充和更正问题可以让您更好地“感受”您想要实现的目标以及系统的特点,以便您收到更好的建议。
我们有项目,我们假设它们是类似矢量的对象,并分配给集群。子集关键字暗示每个项目可能属于一个且仅一个集群(或者可能根本不属于任何集群),但最好确认情况确实如此。
此外,最好了解向量的维度以某种方式标准化(以免项目的相对不重要的特征,但具有相对较大的值范围扭曲余弦相似度或其他距离测量)
我们有用户,他们可以“拥有”多个集群。最好知道(在主线中)给定用户如何“拥有”集群,以及他们的拥有集群是否只是一个布尔属性(拥有或不拥有),或者是否存在某种分类甚至数字度量“拥有”(用户 X 的簇 1 的系数为 0.3,簇 8 的系数为 0.2 等...)
也可以更好地定义测量两个簇之间的余弦相似度的方式(是簇的两个“中心”之间的相似度还是其他东西......
The problem is poorly defined... With more details it may be possible to offer commentary about the validity of the approach, in general (that of using Cosine Similarity, of the way it is calculated etc.) as well as the validity of the approach used in aggregating the final result.
Essentially, you are averaging the Cosine Similarity values computed for each pair of clusters (Ca, Cb) where Ca is a cluster which user A "has" and Cb a cluster which B "has".
I'm guessing this could be greatly improved by using a weighted average which would take into account the amount of "having" of a cluster that a user can exhibit.
Maybe this "having" relationship is purely Boolean: either a user has or doesn't have a particular cluster, but odds are good that his/her "having" can be qualified with either an [ordered] categorical attribute or even a numerical value (be it relative : say a percentage of having of a given cluster a given user has, compared to the other clusters he/she has, or be it absolute).
Because each Cosine Similarity is based on a clusters which user "A" has and a cluster which user "B" has, if properly normalized it could be possible to take the product of the corresponding "having" measures as a coefficient applied to the corresponding Cosine Similarity term in the average computation. In this fashion, if two users are effectively similar but one of them happens to have an extra cluster or two, with very low "having" factors, the aggregate result won't suffer much from this.
Generally distance computation (such as with Cosine Similarity) as well as aggregation formulas (such as the average or weighed average) are very sensitive to the scale of the individual dimensions (and to their relative "importance"). For this reason it is often hard to provide but generic advice such as the above. Theory matters very much with classification problems, but one needs to be be mindful of not applying formulas "blindly": it's easy to loose the forest for the tree ;-)
To help improve the question, here's what I generally understand, please complement and correct the question to provide a better "feel" for what it is you are trying to achieve and what the characteristics of the system are, so that you may receive better suggestion.
We have items which we assume are vector-like objects and which are assigned to clusters. The subset keyword hints that that each item probably belongs to one and only one cluster (or possibly to no cluster at all) but it would be good to confirm that this is the case.
Also it would be good to get an idea of the fact that the dimensions of the vectors are somehow normalized (lest a relatively unimportant characteristic of items, but with a relatively big range of value skews the Cosine Similarity or other distance measurements)
We have users which can "have" several clusters. It would be good to know (in the main lines) how a given user comes to "have" clusters and if their having cluster is only a boolean property (to have or not to have) or if there is some categorical or even numerical measure of the "having" (User X has cluster 1 with a coef of .3 and cluster 8 with a coef of .2 etc...)
The way the Cosine Similarity between two clusters is measured could also be better defined (is it the similarity between the two "centers" of the clusters or is it something else...
有许多方法可以比较集合和簇。配对计数 F 度量、兰德指数……其中大多数都解决了将个体相似性总结为单个整体相似性的问题。
请参阅以下内容以获取一些提示:
http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_Clustering_Results
你必须明白,虽然人类渴望将所有事情总结为一个分数,但这并不总是足够的。这就是为什么有这么多指标。它们都有各自的优点和缺点。
There are many methods for comparing sets and clusters. Pair-counting F-Measures, Rand index, ... Most of these have solved the problem of summarizing individual similarities to a single overall similarity.
See this for some pointers:
http://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_of_Clustering_Results
You must understand that, while it a human desire to summarize everything to a single score, this is not always adequate. This is why there are so many metrics. They all have their pros and cons.