如何计算特征列表之间的相似度?
我有用户和资源。每个资源都由一组特征来描述,每个用户都与一组不同的资源相关。在我的特定情况下,资源是网页,以及有关访问位置、访问时间、访问次数等的特征信息,每次都与特定用户相关联。
我想获得用户之间关于这些功能的相似性度量,但我找不到将资源功能聚合在一起的方法。我已经使用文本特征完成了此操作,因为可以将文档添加在一起然后提取特征(例如 TF-IDF),但我不知道如何继续进行此配置。
为了尽可能清楚,这就是我所拥有的:
>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features
例如,我可以使用欧几里德距离获得文档的相似性矩阵:
>>> euclidean_distance(user_features[0], user_features[0])
但我不知道如何比较用户互相对抗。我应该以某种方式将这些功能聚合在一起,最终得到一个 N_Users X N_Features 矩阵,但我不知道如何实现。
关于如何继续的任何提示?
有关我正在使用的功能的更多信息:
我这里拥有的功能尚未完全修复。到目前为止,我已经得到了 13 个不同的功能,这些功能已经从“视图”中聚合出来。我所拥有的是每个视图的标准差、平均值等,以便获得“平坦”的东西,以便能够比较它们。我的功能之一是:自上次查看以来位置是否发生了变化?一小时前呢?两小时前?
I have users and resources. Each resource is described by a set of features and each user is related to a different set of resources. In my particular case, the resources are web pages, and the features information about the location of the visit, the time of the visit, the number of visit etc, which are tied to a specific user each time.
I want to get a similarity measure between my users regarding those features but I can't find a way to aggregate the resource features together. I've done it with text features, as it is possible to add the documents together and then extract features (say TF-IDF), but I don't know how to proceed with this configuration.
To be as clear as possible, here is what I have:
>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features
I'm able to get a similarity matrix of the documents using euclidean distances for instance:
>>> euclidean_distance(user_features[0], user_features[0])
But I don't know how do I compare the users against each other. I should somehow aggregate the features together to end up with a N_Users X N_Features
matrix, but I don't know how.
Any hints on how to proceed?
Some more information about the features I'm using:
The features I have here are not completely fixed. What I've got so far is 13 different features, already aggregated from "views". What I have is standard deviation, mean, etc. for each of the views, in order to have something "flat", to be able to compare them. One of the feature I have is: was the location changed since the last view? And what about one hour ago? Two hours ago?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果每个用户都表示为一组文档交互向量,则可以将一对用户的相似度定义为表示用户的一对文档交互向量集的相似度。
你说你可以获得文档的相似度矩阵。然后假设用户U1访问了文档D1、D2、D3,并且用户U2访问了文档D1、D3、D4。对于用户 1,您将有两组向量 S1 = {U1(D1), U1(D2), U1(D3)} 和 S2 = {U2(D1), U2(D3), U2(D4)}。请注意,因为每个用户与文档的交互都不同,所以它们的表示方式也不同。如果我理解正确的话,这些集合的元素应该对应于每个用户矩阵中的相应行。
这两个集合之间的相似性可以通过许多不同的方式来计算。一种选择是平均成对相似度:迭代每个集合中元素的所有配对,计算该对的文档相似度,并对所有对求平均值。
If each user is represented as a set of document-interaction vectors you can define the similarity of a pair of users as the similarity of the pair of document-interaction vector sets that represent the users.
You say you can get a similarity matrix of the documents. Then assume that user U1 visited documents D1, D2, D3, and user U2 visited documents D1,D3,D4. You would have two sets of vectors S1 = {U1(D1), U1(D2), U1(D3)} for user 1 and S2 = {U2(D1), U2(D3), U2(D4)}. Note that because each user's interaction with a document is different they are represented as such. If I understand correctly, the elements of these sets should correspond to the respective lines in the matrix of each user.
The similarity between these two sets can be computed in many different ways. One option is the average pair-wise similarity: You iterate over all pairings of the elements from each set, compute the document similarity of the pair, and average over all pairs.
您可以使用每个用户的资源集中的特征的平均值,这似乎是总结用户的自然方法。带有适当的 axis 参数的 numpy.mean 应该可以得到平均值,然后计算结果“用户向量”(长度为 n_features)之间的欧几里得距离,就像之前在之间所做的那样文档向量。
You could use the mean of the features in each user's set of resources seems a natural way to summarize a user.
numpy.mean
with an appropriateaxis
argument should get you the mean, then compute the Euclidean distance between the resulting "user vectors" (of length n_features) as you did before between document vectors.我会考虑创建文档的多个维度,因此在一天中的某些时间访问的那些文档,按早晨和晚上划分,然后绘制夜猫子和早起的鸟儿的用户。
您可以使用任意数量的维度创建用户矩阵,并使用用户之间的距离来提供帮助。
I would look at creating multiple dimensions of documents, so those documents that are visited at certain times of day, divide up by morning and night, and then plot users that are nite owls and early birds.
With any number of dimensions you can create a matrix of users, and use distance between users to help.