如果大多数评级为 5/被动过滤建议,KNN 是否有价值
我一直在考虑建立一个“喜欢 x 的人,也喜欢 y 的人”类型的推荐系统,并且正在考虑使用 Vogoo,但在查看他们的代码后,似乎有很多基于评级的最近邻居。
在过去的几周里,我看到了一些文章,指出大多数人要么根本不评分,要么评分 5 http://youtube-global.blogspot.com/2009/09/ Five-stars-dominate- ratings.html
我目前没有实施了评级系统,如果所有适用的评级没有波动,我真的认为没有必要实施它。
这是否意味着 KNN 并不真正有价值?
有人对开发一个系统以根据过去的观看历史记录(被动过滤)获得相似的推荐有任何建议吗?
我正在使用的数据是基于赛事的,因此,如果您查看过男子双打网球、蓝鸟棒球、大学女子篮球等。我会推荐您所在地区目前正在举办的其他赛事,其他人也看过这些赛事整个系统的类似事件也都看过。
我主要使用 PHP,但已经开始学习 Python(如果有帮助的话,可能需要学习 Java)。
I've been looking at building a 'people who like x, also like y' type recommendation system, and was looking at using Vogoo, but after looking through their code it seems there is a lot of nearest neighbor based on ratings.
Over the last few weeks I've seen a few articles stating that most people either don't rate at all, or rate a 5 http://youtube-global.blogspot.com/2009/09/five-stars-dominate-ratings.html
I don't currently have a ratings system implemented, and I don't really see the need to implement it if all the applicable ratings don't fluctuate.
Does this mean that KNN isn't really valuable?
Does anybody have any recommendations for developing a system to get recommendations of similar likeness based on past viewing history (passive filtering)?
The data I'm working with is event based, so if you've looked at mens doubles-tennis, blue jays baseball, college womens basket ball, etc. I'd recommend other events that are currently in your area which others who looked at similar events across the entire system have also viewed.
I mostly work with PHP, but have been starting to learn Python (and probably need to learn Java, if that helps).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
好吧,对你的第一个问题的简短回答是否定的。如果您的数据没有变化(YouTube 明星),则很难提出建议。
我可能建议尝试扩大您拥有的数据量。对于 YouTube 示例,不仅要查看星级,还要考虑观看视频的百分比。大量的暂停、搜索、倒带可能意味着用户喜欢该视频并希望更频繁地观看某些部分,因此应该从中获得提升。
至少在音乐世界中,进行推荐的标准方法是提出一个可以使用的距离度量,它可以给出任意两首音乐之间的距离。然后,当您找出用户喜欢的音乐类型时,您可以通过根据距离度量选择“接近”的歌曲来选择与他们的口味相似的音乐类型。它们也称为相似度矩阵,其中距离高的两个项目的相似度较低。
所以问题归结为如何产生这些相似之处。一种方法是计算有多少人观看了节目 A 也观看了节目 B。如果您对每对事件都执行此操作,您将能够从您分析的语料库中提出建议。不幸的是,这并不能很好地扩展到为您尚不知道有多少人观看的活动(现场活动而不是录制的活动)提供推荐。
但这至少是一个开始。
Well, the curt answer to your first question would be no. If you have no variation in your data (YouTube stars), it's difficult to make a recommendation.
What I might suggest is trying to expand the amount of data you have. For the YouTube example, instead of just looking at the star ratings, also consider the percentage of the video that was watched. Lots of pausing, seeking, rewinding might mean that the user liked the video and wanted to see parts more often, so it should get a boost from that.
The standard way of doing recommendation, at least in the music world, is to come up with a distance metric that you can use, which gives you a distance between any two pieces of music. Then when you find out the type of music a user likes, you can pick one that's similar to their tastes by picking songs that are "close" according to the distance metric. They are also called similarity matrices, where two items with high distance would have low similarity.
So the question comes down to how you generate these similarities. One way you could do it would be to count how many people that watched show A also watched show B. If you do this for every pair of events, you'll be able to make recommendations from the corpus you've analyzed. Unfortunately, this doesn't extend well to making recommendations for events where you don't already know how many people watched them (live events instead of recorded ones).
This is at least a start though.
在安德鲁斯做出了很好的回应之后,我决定解释一下我所做的事情,并希望它可以帮助其他人(尽管它可能特定于我的实现)。
请记住,我已经获得了许多事件以及这些事件发生地点的数据。
我用来构建推荐的脚本就是这个。
http://www.codediesel.com/php/item-based -collaborative-filtering-php/
但是,系统中没有任何评级,并且由于基于用户的评级的“可疑”值,我根据数据集中已有的相似性创建了评级。
我基本上是这样构建的。
根据我正在使用的样本量,这实际上比我预期的要好得多。
然而,它的运行速度却非常慢。
不知道从这里我将如何进步。
After Andrews great response, I've decided to explain what I've done and hope it may help others (though it may be specific to my implementation).
Keeping in mind that I've got data on LOTS of events and where those events take place.
The script I used to build recommendations was this one.
http://www.codediesel.com/php/item-based-collaborative-filtering-php/
However, without having any ratings already in the system, and due to the 'questionable' value of user based ratings, I created ratings based on the similarities I already had in the data set.
I basically structured it like this
This actually worked surprisingly better than I had expected based on the sample size I was working with.
However, it is painfully slow to run.
Not sure how I'll progress from here.
确实,大多数人只评价他们真正喜欢的东西。您对时间数据很幸运,因为您可以根据用户观看体育比赛的时长,免费获得诚实的、基于决策效用的“评级”。
我会记录他们观看节目的时间作为用户的“评级”。你的情况特别简单,因为你得到了小数点后的精确度!
It's true that most people only rate stuff that they really like. You are in luck with your time data because you get an honest, decision-utility-based "rating" for free, based on how long the user watched the sport.
I'd take the log of how long they watched the programme as the user's "rating". Your case is especially easy because you get decimal places of accuracy!