Mahout 协同过滤输入二进制数据集

发布于 2024-12-07 18:58:30 字数 413 浏览 5 评论 0原文

我是驯象师的新手。

我已经使用了 mahout 的基于项目的算法和对数似然相似性度量。我在过去的帖子中读到,当推荐器处理二进制值(喜欢或不喜欢)时,最好使用对数似然相似性。我还读到 Mahout 使用三个值(喜欢、不喜欢、不存在)。所以我对输入数据集文件的格式有点困惑。

输入文件格式必须是这样的吗?

 userId, itemID

默认情况下首选项是 1 吗?

我想知道是否有办法将不喜欢的信息放入数据集中。

除了输入数据集文件之外,我会像这样:

userid、itemid、binaryPreference 1, 15, 1.0

2, 35, 0

1, 25, 1.0 ......

请帮助我! 提前致谢!

i am new to mahout.

I have already used mahout's item based algorithm with a loglikelihood similarity measure. I read in past threads that it is better to use loglikelihood similarity when the recommender handles binary values (like or dislike). I also read that mahout uses three values (like, dislike, non exist ). So i get confused a little bit, about the format of the input dataset file.

Does the input file format have to be like this ?

 userId, itemID

where the preference by default is 1?

I would like to know if there is a way to put the dislike info in the dataset.

I would except for example the input dataset file, be something like this :

userid, itemid, binaryPreference
1, 15, 1.0

2, 35, 0

1, 25, 1.0
......

Help me please!
Thanx in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

音盲 2024-12-14 18:58:30

我不确定你在哪里读到的,但这是错误的。 Mahout 中没有三态“布尔”偏好。您的数据中要么有评级,要么没有,在这种情况下,您有布尔偏好,该偏好要么存在,要么不存在。没有第三种状态。

尽管看起来很奇怪,但我鼓励您首先尝试将“喜欢”和“不喜欢”视为相同的东西。它可能会运作良好。

您稍后可以尝试将人工评级纳入 -1 到 1 的范围内,或者使用代表喜欢、不喜欢和介于两者之间的阴影的值。然后,您可以尝试其他相似性度量(例如欧几里德距离),看看它的效果如何。

第三种可能性是构建两个推荐器:一个具有“喜欢”关联,另一个具有具有“不喜欢”关联的数据模型。您可以使用“喜欢”推荐器的输出,并通过“不喜欢”推荐器的结果过滤或修改结果。这需要一些编码,但并不难。

[email protected] 将是跟进此问题的好地方。

I am not sure where you read that, but it's wrong. There is no three-state "boolean" preference in Mahout. You either have ratings in your data, or you don't, in which case you have boolean preferences, which either exist or do not exist. There is no third state.

As strange as it may seem, I'd encourage you try to treating "like" and "dislike" as the same, to start. It might work well.

You can later try incorporating artificial ratings on a scale of -1 to 1 or something to represent like, dislike and shades in between. You could then try other similarity metrics like Euclidean distance to see how it does.

A third possibility is to build two recommenders: one has the "like" associations and the other has a data model with "dislike" associations. You could use the output of the "like" recommender, and filter or modify the results by the results of the 'dislike' recommender. This would require some coding, but isn't hard.

[email protected] would be a good place to follow up on this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文