Apache Mahout +欧几里德距离:意外的结果
我使用 Mahout 的 EuclideanDistanceSimilarity 类对给定以下用户偏好数据集的多个用户的相似度进行排名。当前首选项的范围是从 1 到 5(含 1 和 5)的所有整数。不过我可以控制规模,所以如果有帮助的话可以改变。
User Preferences:
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6
1 2 4 3 5 1 2
2 5 1 5 1 5 1
3 1 5 1 5 1 5
4 2 4 3 5 1 2
5 3 3 4 5 2 2
当我运行以下测试代码时,我得到了意想不到的结果,我将其添加到此处找到的测试类中: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java。 html
@Test
public void testSimple2() throws Exception {
DataModel dataModel = getDataModel(
new long[]{1, 2, 3, 4, 5},
new Double[][]{
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
{1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
for (int i = 1; i <= 5; i++) {
for (int j = 1; j <= 5; j++) {
System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
}
}
}
它会产生以下结果:
1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0
有人可以帮助我理解我在这里做错了什么吗?显然,用户 1 的偏好与用户 3 和用户 3 不同。 5、那么为什么我的相似度会得到1.0呢?
如果欧几里得不起作用,我愿意使用不同的算法,但是皮尔逊不适合我,因为我需要处理为每个项目提交相同偏好的用户,并且我不想纠正“等级膨胀”。
I'm using Mahout's EuclideanDistanceSimilarity class to rank the similarity of several users given the following data set of user preferences. The range for preferences is currently all integers from 1 to 5 inclusive. However I have control over the scale, so that can change if it would help.
User Preferences:
Item 1 Item 2 Item 3 Item 4 Item 5 Item 6
1 2 4 3 5 1 2
2 5 1 5 1 5 1
3 1 5 1 5 1 5
4 2 4 3 5 1 2
5 3 3 4 5 2 2
I'm getting unexpected results when I run the following test code, which I added to the Test class found here: http://www.massapi.com/source/mahout-distribution-0.4/core/src/test/java/org/apache/mahout/cf/taste/impl/similarity/EuclideanDistanceSimilarityTest.java.html
@Test
public void testSimple2() throws Exception {
DataModel dataModel = getDataModel(
new long[]{1, 2, 3, 4, 5},
new Double[][]{
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{5.0, 1.0, 5.0, 1.0, 5.0, 1.0},
{1.0, 5.0, 1.0, 5.0, 1.0, 5.0},
{2.0, 4.0, 3.0, 5.0, 1.0, 2.0},
{3.0, 3.0, 4.0, 5.0, 2.0, 2.0},});
for (int i = 1; i <= 5; i++) {
for (int j = 1; j <= 5; j++) {
System.out.println( i + "," + j + ": " + new EuclideanDistanceSimilarity(dataModel).userSimilarity(i, j));
}
}
}
It produces the following results:
1,1: 1.0
1,2: 0.7129109430106292
1,3: 1.0
1,4: 1.0
1,5: 1.0
2,1: 0.7129109430106292
2,2: 1.0
2,3: 0.5556605665978556
2,4: 0.7129109430106292
2,5: 0.8675434911352263
3,1: 1.0
3,2: 0.5556605665978556
3,3: 1.0
3,4: 1.0
3,5: 0.9683428667784535
4,1: 1.0
4,2: 0.7129109430106292
4,3: 1.0
4,4: 1.0
4,5: 1.0
5,1: 1.0
5,2: 0.8675434911352263
5,3: 0.9683428667784535
5,4: 1.0
5,5: 1.0
Would someone please help me understand what I'm doing wrong here? Clearly, user 1's preferences are not identical to users 3 & 5, so why do I get 1.0 for the similarity?
I'm open to using a different algorithm if Euclidean won't work, however Pearson doesn't work for me because I need to handle users that submit identical preferences for each item and I do not want to correct for "grade inflation."
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这有点奇怪,但我可以解释发生了什么。
欧几里得距离 d 不能直接用作相似性度量,因为它随着“相似性越小”而变大。您可以使用 1/d,但完美匹配会导致无穷大,而不是 1。您可以使用 1/(1+d)。
问题是距离只能根据两个用户共有的维度来计算。更多维度通常意味着更远的距离。因此,它会惩罚重叠,这与您的预期相反。
所以公式实际上是 n/(1+d),其中 n 是重叠的维数。这会导致相似度大于 1,但在某些情况下会限制为 1。
n 不是正确的因子。这是一个古老的简单的拼凑。我会在邮件列表上询问正确的表达方式。对于大数据,这往往可以正常工作。
It is a little weird but I can explain what's happening.
The Euclidean distance d can't be used as a similarity metric directly since it gets bigger with "less similarity". You could use 1/d, but then perfect matches result in infinity, not 1. You can use 1/(1+d).
The problem is that the distance can only be calculated over dimensions that both users have in common. More dimensions typically means more distance. So it's penalizing overlap, the opposite of what you'd expect.
So the formula is really n/(1+d), where n is the number of dimensions of overlap. That results in a similarity greater than 1, which is capped back to 1, in some cases.
n is not the right factor. It's an old simple kludge. I will ask on the mailing list about the right-er expression. For large data, this tends to work OK though.