结合不同的相似点来构建最终的相似点
我对数据挖掘和推荐系统非常陌生,现在尝试为具有以下参数的用户构建某种记录系统:
- 城市
- 教育
- 兴趣
为了计算它们之间的相似度,我将应用余弦相似度和离散相似度。 例如:
- 城市:如果 x = y,则 d(x,y) = 0。否则,d(x,y) = 1。
- 教育:这里我将使用余弦相似度,因为单词出现在部门或学士学位的名称中
- 兴趣:用户可以选择硬编码的兴趣数量,余弦相似度将基于两个向量计算,如下所示:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
其中 1
表示兴趣的存在,n
是所有兴趣的总数。
我的问题是: 如何以适当的顺序组合这 3 个相似之处?我的意思是,仅仅对它们进行求和听起来不太聪明,不是吗?另外我也想听听大家对我的“新手相似系统”的评论,哈哈。
Im pretty much new to data mining and recommendation systems, now trying to build some kind of rec system for users that have such parameters:
- city
- education
- interest
To calculate similarity between them im gonna apply cosine similarity and discrete similarity.
For example:
- city : if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1.
- education : here i will use cosine similarity as words appear in the name of the department or bachelors degree
- interest : there will be hardcoded number of interest user can choose and cosine similarity will be calculated based on two vectors like this:
1 0 0 1 0 0 ... n
1 1 1 0 1 0 ... n
where 1
means the presence of the interest and n
is the total number of all interests.
My question is:
How to combine those 3 similarities in appropriate order? I mean just summing them doesnt sound quite smart, does it? Also I would like to hear comments on my "newbie similarity system", hah.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
没有一成不变的答案,因为这里的答案很大程度上取决于您的输入和问题领域。因此,机器学习的许多工作都是准备输入的艺术(而不是科学)。我可以给你一些一般性的想法供你思考。您有两个问题:从这些项目中找出有意义的相似之处,然后将它们组合起来。
城市相似度听起来很合理,但实际上取决于您的领域。难道真的是在同一个城市就意味着一切,而在相邻城市就没有任何意义吗?例如,位于类似规模的城市有什么意义吗?处于同一个状态?如果他们这样做了,你的相似性应该反映了这一点。
教育:我理解为什么你可能会使用余弦相似度,但这并不能解决这里的真正问题,即处理表示同一事物的不同标记。你需要“eng”和“engineering”来匹配,以及“ba”和“bachelors”,诸如此类的东西。一旦你以这种方式准备了令牌,它可能会产生良好的结果。
兴趣:我不认为余弦是这里的最佳选择,尝试一个简单的谷本系数相似度(只是交集的大小超过并集的大小)。
你不能只是对它们求和,因为我假设你仍然想要一个 [0,1] 范围内的值。你可以对它们进行平均。这就假设每个的输出都是直接可比较的,如果你愿意的话,它们是相同的“单位”。他们不在这里;例如,它们并不像是概率。
在实践中,对它们进行平均可能仍然有效,也许可以使用权重。例如,在同一个城市和拥有完全相同的兴趣一样重要。这是真的还是应该不那么重要?
您可以尝试测试不同的变化和权重,希望您有一些针对历史数据进行测试的方案。我会向您介绍我们的项目 Mahout,因为它有一个完整的推荐和评估框架。
然而,所有这些类型的解决方案都是hacky和启发式的。我认为您可能想要采用更正式的方法来进行特征编码和相似性。如果您愿意购买一本书并且喜欢 Mahout,Mahout in Action 有聚类章节很好地介绍了如何选择和编码特征,以及如何从中产生相似性。
There are not hard-and-fast answers, since the answers here depend greatly on your input and problem domain. A lot of the work of machine learning is the art (not science) of preparing your input, for this reason. I could give you some general ideas to think about. You have two issues: making meaningful similarities out of each of these items, and then combining them.
The city similarity sounds reasonable but really depends on your domain. Is it really the case that being in the same city means everything, and being in neighboring cities means nothing? For example does being in similarly-sized cities count for anything? In the same state? If they do your similarity should reflect that.
Education: I understand why you might use cosine similarity but that is not going to address the real problem here, which is handling different tokens that mean the same thing. You need "eng" and "engineering" to match, and "ba" and "bachelors", things like that. Once you prepare the tokens that way it might give good results.
Interest: I don't think cosine will be the best choice here, try a simple tanimoto coefficient similarity (just size of intersection over size of union).
You can't just sum them, as I assume you still want a value in the range [0,1]. You could average them. That makes the assumption that the output of each of these are directly comparable, that they're the same "units" if you will. They aren't here; for example it's not as if they are probabilities.
It might still work OK in practice to average them, perhaps with weights. For example, being in the same city here is as important as having exactly the same interests. Is that true or should it be less important?
You can try and test different variations and weights as hopefully you have some scheme for testing against historical data. I would point you at our project, Mahout, as it has a complete framework for recommenders and evaluation.
However all these sorts of solutions are hacky and heuristic. I think you might want to take a more formal approach to feature encoding and similarities. If you're willing to buy a book and like Mahout, Mahout in Action has good coverage in the clustering chapters on how to select and encode features and then how to make one similarity out of them.
这是机器学习中的常用技巧。
我认为这意味着您使用 k 之一编码。那挺好的。
您还可以在这里使用 one-of-K 编码,以生成大小为 |V| 的向量其中 V 是词汇表,即训练数据中的所有单词。
如果您现在对兴趣数进行归一化,使其始终落在 [0,1] 范围内,那么您可以在最终向量之间使用普通的 L1(曼哈顿)或 L2(欧几里德)距离度量。后者对应于信息检索的余弦相似度度量。
尝试 L1 和 L2 来决定哪个最好。
Here's the usual trick in machine learning.
I take this to mean you use a one-of-K coding. That's good.
You can also use a one-of-K coding here, to produce a vector of size |V| where V is the vocabulary, i.e. all words in your training data.
If you now normalize the interest number so that it always falls in the range [0,1], then you can use ordinary L1 (Manhattan) or L2 (Euclidean) distance metrics between your final vectors. The latter corresponds to the cosine similarity metric of information retrieval.
Experiment with L1 and L2 to decide which is best.