当前位置：文江博客话题详情

计算相似度的方法

发布于 2024-09-04 00:50:11 字数 138 浏览 15 评论 0 原文

我正在做一个社区网站，需要我计算任意两个用户之间的相似度。每个用户都被描述为以下属性：

年龄、皮肤类型（油性、干性）、头发类型（长、短、中）、生活方式（活跃的户外爱好者、电视迷）等。

谁能告诉我如何解决这个问题或向我指出一些资源？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

且行且努力 2024-09-11 00:50:11

另一种计算方法（在 R 中）数据集中观测值之间的所有成对差异（距离）。原始变量可能是混合类型。名义、序数和（a）对称二进制数据的处理是通过使用 Gower 的一般相异系数来实现的（Gower, JC (1971) 一般相似系数及其一些属性，Biometrics 27, 857–874）。如需了解更多信息，请参阅第 47 页。如果 x 包含这些数据类型的任何列，则将使用高尔系数作为度量。

例如，

x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)

library(cluster)
daisy(x, metric = "euclidean")

您将得到：

Dissimilarities :
         1        2        3        4
2 2.000000                           
3 3.316625 2.236068                  
4 2.236068 1.732051 1.414214         
5 4.242641 3.741657 1.732051 2.645751

如果您对分类数据的降维方法感兴趣（也是将变量排列到同构簇中的一种方法），请检查这个

Another way of computing (in R) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved by using the general dissimilarity coefficient of Gower (Gower, J. C. (1971) A general coefficient of similarity and some of its properties, Biometrics 27, 857–874). For more check out this on page 47. If x contains any columns of these data-types, Gower's coefficient will be used as the metric.

For example

x1 <- factor(c(10, 12, 25, 14, 29))
x2 <- factor(c("oily", "dry", "dry", "dry", "oily"))
x3 <- factor(c("medium", "short", "medium", "medium", "long"))
x4 <- factor(c("active outdoor lover", "TV junky", "TV junky", "active outdoor lover", "TV junky"))
x <- cbind(x1,x2,x3,x4)

library(cluster)
daisy(x, metric = "euclidean")

you'll get :

Dissimilarities :
         1        2        3        4
2 2.000000                           
3 3.316625 2.236068                  
4 2.236068 1.732051 1.414214         
5 4.242641 3.741657 1.732051 2.645751

If you are interested on a method for dimensionality reduction for categorical data (also a way to arrange variables into homogeneous clusters) check this

回复收藏 0 原文

梦里°也失望 2024-09-11 00:50:11

为每个属性赋予适当的权重，并添加值之间的差异。

enum SkinType
    Dry, Medium, Oily

enum HairLength
    Bald, Short, Medium, Long

UserDifference(user1, user2)
    total := 0
    total += abs(user1.Age - user2.Age) * 0.1
    total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
    total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
    # etc...
    return total

如果您确实需要相似性而不是差异，请使用1 / UserDifference(a, b)

Give each attribute an appropriate weight, and add the differences between values.

enum SkinType
    Dry, Medium, Oily

enum HairLength
    Bald, Short, Medium, Long

UserDifference(user1, user2)
    total := 0
    total += abs(user1.Age - user2.Age) * 0.1
    total += abs((int)user1.Skin - (int)user2.Skin) * 0.5
    total += abs((int)user1.Hair - (int)user2.Hair) * 0.8
    # etc...
    return total

If you really need similarity instead of difference, use 1 / UserDifference(a, b)

回复收藏 0 原文

披肩女神 2024-09-11 00:50:11

您可能应该看看

数据挖掘和数据仓库（必需）
机器学习（额外）
人工神经网络（特别是SOM)
模式识别（相关）

这些主题将让您的程序识别用户集合中的相似性和集群，并尝试适应它们...

然后您可以了解不同的情况相关用户的隐藏的常见群体...（即绿头发的用户通常不喜欢看电视...）

作为建议，请尝试使用现成的实现工具来实现此功能，而不是自己实现。 ..
查看开放目录数据挖掘项目

回复收藏 0 原文

梦旅人picnic 2024-09-11 00:50:11

实现两个数据点之间的差异的简单主观度量的三个步骤可能在您的情况下工作得很好：

将所有变量捕获在代表性数值变量中，例如：皮肤类型（油性=-1，干性=1），头发类型（长=2，短=0，中=1），生活方式（活跃的户外爱好者=1，电视迷=-1），年龄是一个数字。
缩放所有数字范围，使它们符合您为指示差异而赋予它们的相对重要性。例如：10岁的年龄差异大约相当于长发和中发之间的差异，以及油性皮肤和干性皮肤之间的差异。因此，年龄尺度上的 10 与头发尺度上的 1 与皮肤尺度上的 2 的差别一样，因此将年龄差异缩放 0.1，将头发中的差异缩放为 1，将皮肤中的差异缩放为 0.5
使用适当的 < a href="http://en.wikipedia.org/wiki/Metric_%28mathematics%29" rel="nofollow noreferrer">距离度量，将两个人之间在不同尺度上的差异合并为一个总体差异。这个数字越小，它们就越相似。我建议使用简单的二次差作为距离函数的第一次尝试。

然后可以使用以下方法计算两个人之间的差异（我假设 Person.age、.skin、.hair 等已经完成步骤 1 并且是数字）：

double Difference(Person p1, Person p2) {

    double agescale=0.1;
    double skinscale=0.5;
    double hairscale=1;
    double lifestylescale=1;

    double agediff = (p1.age-p2.age)*agescale;
    double skindiff = (p1.skin-p2.skin)*skinscale;
    double hairdiff = (p1.hair-p2.hair)*hairscale;
    double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;

    double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
    return diff;
}

请注意，此示例中的 diff 并不像 ( 0..1)。它的值范围可以从 0（无差异）到较大的值（高差异）。而且，这种方法几乎完全不科学，它只是为了快速给你一个工作差异度量而设计的。

Three steps to achieve a simple subjective metric for difference between two datapoints that might work fine in your case:

Capture all your variables in a representative numeric variable, for example: skin type (oily=-1, dry=1), hair type (long=2, short=0, medium=1),lifestyle (active outdoor lover=1, TV junky=-1), age is a number.
Scale all numeric ranges so that they fit the relative importance you give them for indicating difference. For example: An age difference of 10 years is about as different as the difference between long and medium hair, and the difference between oily and dry skin. So 10 on the age scale is as different as 1 on the hair scale is as different as 2 on the skin scale, so scale the difference in age by 0.1, that in hair by 1 and and that in skin by 0.5
Use an appropriate distance metric to combine the differences between two people on the various scales in one overal difference. The smaller this number, the more similar they are. I'd suggest simple quadratic difference as a first attempt at your distance function.

Then the difference between two people could be calculated with (I assume Person.age, .skin, .hair, etc. have already gone through step 1 and are numeric):

double Difference(Person p1, Person p2) {

    double agescale=0.1;
    double skinscale=0.5;
    double hairscale=1;
    double lifestylescale=1;

    double agediff = (p1.age-p2.age)*agescale;
    double skindiff = (p1.skin-p2.skin)*skinscale;
    double hairdiff = (p1.hair-p2.hair)*hairscale;
    double lifestylediff = (p1.lifestyle-p2.lifestyle)*lifestylescale;

    double diff = sqrt(agediff^2 + skindiff^2 + hairdiff^2 + lifestylediff^2);
    return diff;
}

Note that diff in this example is not on a nice scale like (0..1). It's value can range from 0 (no difference) to something large (high difference). Also, this method is almost completely unscientific, it is just designed to quickly give you a working difference metric.

回复收藏 0 原文