Apache Mahout +欧几里德距离：意外的结果

发布于 2024-12-10 21:47:56 字数 2161 浏览 3 评论 0原文

我使用 Mahout 的 EuclideanDistanceSimilarity 类对给定以下用户偏好数据集的多个用户的相似度进行排名。当前首选项的范围是从 1 到 5（含 1 和 5）的所有整数。不过我可以控制规模，所以如果有帮助的话可以改变。

User    Preferences:
        Item 1    Item 2    Item 3    Item 4    Item 5    Item 6
 1       2         4         3         5         1         2
 2       5         1         5         1         5         1
 3       1         5         1         5         1         5
 4       2         4         3         5         1         2
 5       3         3         4         5         2         2

当我运行以下测试代码时，我得到了意想不到的结果，我将其添加到此处找到的测试类中： http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java。 html

@Test
public void testSimple2() throws Exception {
    DataModel dataModel = getDataModel(
            new long[]{1, 2, 3, 4, 5},
            new Double[][]{
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
                {1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
    for (int i = 1; i <= 5; i++) {
        for (int j = 1; j <= 5; j++) {
            System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
        }
    }
}

它会产生以下结果：

1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0

有人可以帮助我理解我在这里做错了什么吗？显然，用户 1 的偏好与用户 3 和用户 3 不同。 5、那么为什么我的相似度会得到1.0呢？

如果欧几里得不起作用，我愿意使用不同的算法，但是皮尔逊不适合我，因为我需要处理为每个项目提交相同偏好的用户，并且我不想纠正“等级膨胀”。

原文

I'm using Mahout's EuclideanDistanceSimilarity class to rank the similarity of several users given the following data set of user preferences. The range for preferences is currently all integers from 1 to 5 inclusive. However I have control over the scale, so that can change if it would help.

User    Preferences:
        Item 1    Item 2    Item 3    Item 4    Item 5    Item 6
 1       2         4         3         5         1         2
 2       5         1         5         1         5         1
 3       1         5         1         5         1         5
 4       2         4         3         5         1         2
 5       3         3         4         5         2         2

I'm getting unexpected results when I run the following test code, which I added to the Test class found here: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java.html

@Test
public void testSimple2() throws Exception {
    DataModel dataModel = getDataModel(
            new long[]{1, 2, 3, 4, 5},
            new Double[][]{
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
                {1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
                {2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
                {3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
    for (int i = 1; i <= 5; i++) {
        for (int j = 1; j <= 5; j++) {
            System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
        }
    }
}

It produces the following results:

1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0

Would someone please help me understand what I'm doing wrong here? Clearly, user 1's preferences are not identical to users 3 & 5, so why do I get 1.0 for the similarity?

I'm open to using a different algorithm if Euclidean won't work, however Pearson doesn't work for me because I need to handle users that submit identical preferences for each item and I do not want to correct for "grade inflation."

分享到QQ

分享到微博