如何计算特征列表之间的相似度？

发布于 2024-11-29 01:51:42 字数 823 浏览 0 评论 0原文

我有用户和资源。每个资源都由一组特征来描述，每个用户都与一组不同的资源相关。在我的特定情况下，资源是网页，以及有关访问位置、访问时间、访问次数等的特征信息，每次都与特定用户相关联。

我想获得用户之间关于这些功能的相似性度量，但我找不到将资源功能聚合在一起的方法。我已经使用文本特征完成了此操作，因为可以将文档添加在一起然后提取特征（例如 TF-IDF），但我不知道如何继续进行此配置。

为了尽可能清楚，这就是我所拥有的：

>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features

例如，我可以使用欧几里德距离获得文档的相似性矩阵：

>>> euclidean_distance(user_features[0], user_features[0])

但我不知道如何比较用户互相对抗。我应该以某种方式将这些功能聚合在一起，最终得到一个 N_Users X N_Features 矩阵，但我不知道如何实现。

关于如何继续的任何提示？

有关我正在使用的功能的更多信息：

我这里拥有的功能尚未完全修复。到目前为止，我已经得到了 13 个不同的功能，这些功能已经从“视图”中聚合出来。我所拥有的是每个视图的标准差、平均值等，以便获得“平坦”的东西，以便能够比较它们。我的功能之一是：自上次查看以来位置是否发生了变化？一小时前呢？两小时前？

原文

I have users and resources. Each resource is described by a set of features and each user is related to a different set of resources. In my particular case, the resources are web pages, and the features information about the location of the visit, the time of the visit, the number of visit etc, which are tied to a specific user each time.

I want to get a similarity measure between my users regarding those features but I can't find a way to aggregate the resource features together. I've done it with text features, as it is possible to add the documents together and then extract features (say TF-IDF), but I don't know how to proceed with this configuration.

To be as clear as possible, here is what I have:

>>> len(user_features)
13 # that's my number of users
>>> user_features[0].shape
(2374, 17) # 2374 documents for this user, and 17 features

I'm able to get a similarity matrix of the documents using euclidean distances for instance:

>>> euclidean_distance(user_features[0], user_features[0])

But I don't know how do I compare the users against each other. I should somehow aggregate the features together to end up with a N_Users X N_Features matrix, but I don't know how.

Any hints on how to proceed?

Some more information about the features I'm using:

The features I have here are not completely fixed. What I've got so far is 13 different features, already aggregated from "views". What I have is standard deviation, mean, etc. for each of the views, in order to have something "flat", to be able to compare them. One of the feature I have is: was the location changed since the last view? And what about one hour ago? Two hours ago?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不知所踪 2024-12-06 01:51:42

如果每个用户都表示为一组文档交互向量，则可以将一对用户的相似度定义为表示用户的一对文档交互向量集的相似度。

你说你可以获得文档的相似度矩阵。然后假设用户U1访问了文档D1、D2、D3，并且用户U2访问了文档D1、D3、D4。对于用户 1，您将有两组向量 S1 = {U1(D1), U1(D2), U1(D3)} 和 S2 = {U2(D1), U2(D3), U2(D4)}。请注意，因为每个用户与文档的交互都不同，所以它们的表示方式也不同。如果我理解正确的话，这些集合的元素应该对应于每个用户矩阵中的相应行。

这两个集合之间的相似性可以通过许多不同的方式来计算。一种选择是平均成对相似度：迭代每个集合中元素的所有配对，计算该对的文档相似度，并对所有对求平均值。

回复收藏 0 原文