查找两个数组之间的余弦相似度

发布于 2024-08-26 21:37:08 字数 87 浏览 9 评论 0原文

我想知道 R 中是否有一个内置函数可以找到两个数组之间的余弦相似度(或余弦距离)?

目前,我实现了自己的功能,但我不禁认为R应该已经自带了一个。

I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays?

Currently, I implemented my own function, but I can't help but think that R should already come with one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

孤独患者 2024-09-02 21:37:08

这类问题一直都会出现(对我来说 - 正如 r 标记的 SO 问题列表所证明的 - 其他人也是如此):

R 中是否有一个函数核心或任何 R 包中,有 x 吗? 如果是,

我在 CRAN 的 +2000 个 R 包中哪里可以找到它?

简短的回答:给出 >sos 包出现此类问题时尝试一下

早期的答案之一给出了cosine以及指向其帮助页面的链接。这可能正是OP想要的。当您查看链接到的页面时,您会发现该函数位于 lsa 包中。

但是如果您还不知道在哪个包中查找该函数,您将如何找到它?

您始终可以尝试标准 R 帮助函数(“>”)下面仅表示 R 命令行):

> ?<some_name>

> ??<some_name>

> *apropos*<some_name>

如果这些失败,则安装并安装加载 sos 包,然后

***findFn***

findFn 也被别名为“???”,尽管我不经常使用它,因为我不'您认为您可以在此处传递除

问题的函数名称之外的参数,请尝试以下操作:

> library(sos)

> findFn("cosine", maxPages=2, sortby="MaxScore")

传入的附加参数(“maxPages=2”和“sortby=”MaxScore”)仅限制返回结果的数量,并指定如何分别对结果进行排名——即“找到一个名为‘cosine’的函数或者函数描述中有‘cosine’的函数,只返回两页结果,并按照相关性得分降序排列

”上面的 findFn 调用返回一个包含九列的数据框,结果以行形式呈现 -

扫描最后一列,描述和链接,项目 (行)21,您会发现:

余弦测量(矩阵)

此文本也是一个链接;单击它会将您带到包含该函数的包中该函数的帮助页面 - 换句话说

< em>使用findFn,你可以很快找到你想要的功能即使你不知道它在哪个包

These sort of questions come up all the time (for me--and as evidenced by the r-tagged SO question list--others as well):

is there a function, either in R core or in any R Package, that does x? and if so,

where can i find it among the +2000 R Packages in CRAN?

short answer: give the sos package a try when these sort of questions come up

One of the earlier answers gave cosine along with a link to its help page. This is probably exactly what the OP wants. When you look at the linked-to page you see that this function is in the lsa package.

But how would you find this function if you didn't already know which Package to look for it in?

you can always try the standard R help functions (">" below just means the R command line):

> ?<some_name>

> ??<some_name>

> *apropos*<some_name>

if these fail, then install & load the sos package, then

***findFn***

findFn is also aliased to "???", though i don't often use that because i don't think you can pass in arguments other than the function name

for the question here, try this:

> library(sos)

> findFn("cosine", maxPages=2, sortby="MaxScore")

The additional arguments passed in ("maxPages=2" and "sortby="MaxScore") just limits the number of results returned, and specifies how the results are ranked, respectively--ie, "find a function named 'cosine' or that has the term 'cosine' in the function description, only return two pages of results, and order them by descending relevance score"

The findFn call above returns a data frame with nine columns and the results as rows--rendered as HTML.

Scanning the last column, Description and Link, item (row) 21 you find:

Cosine Measures (Matrices)

this text is also a link; clicking on it takes you to the help page for that function in the Package which contains that function--in other words

using findFn, you can pretty quickly find the function you want even though you have no idea which Package it's in

┊风居住的梦幻卍 2024-09-02 21:37:08

看起来已经有一些选项可用,但我只是偶然发现了一个我喜欢的惯用解决方案,所以我想我会将其添加到列表中。

install.packages('proxy') # Let's be honest, you've never heard of this before.
library('proxy') # Library of similarity/dissimilarity measures for 'dist()'
dist(m, method="cosine")

It looks like a few options are already available, but I just stumbled across an idiomatic solution I like so I thought I'd add it to the list.

install.packages('proxy') # Let's be honest, you've never heard of this before.
library('proxy') # Library of similarity/dissimilarity measures for 'dist()'
dist(m, method="cosine")
若沐 2024-09-02 21:37:08

根据 Jonathan Chang 的评论,我编写了这个函数来模仿 dist。无需加载额外的包。

cosineDist <- function(x){
  as.dist(1 - x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
}

Taking the comment from Jonathan Chang I wrote this function to mimic dist. No extra packages to load.

cosineDist <- function(x){
  as.dist(1 - x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
}
喜你已久 2024-09-02 21:37:08

您还可以检查纯素包: http://cran.r -project.org/web/packages/vegan//index.html

该包中的函数vegdist有多种相异(距离)函数,如manhattaneuclidean堪培拉布雷kulczynski杰卡德高尔altGowermorisita喇叭mountfordraup二项式< /code>、chaocao。请检查包中的 .pdf 以获取定义或查阅参考资料 https://stats.stackexchange.com/a/33001/ 12733

You can also check the vegan package: http://cran.r-project.org/web/packages/vegan//index.html

The function vegdist in this package has a variety of dissimilarity (distance) functions, such as manhattan, euclidean, canberra, bray, kulczynski, jaccard, gower, altGower, morisita, horn,mountford, raup , binomial, chao or cao. Please check the .pdf in the package for a definition or consult references https://stats.stackexchange.com/a/33001/12733.

唔猫 2024-09-02 21:37:08

如果您有点积矩阵,则可以使用此函数来计算余弦相似度矩阵:

get_cos = function(S){
  doc_norm = apply(as.matrix(dt),1,function(x) norm(as.matrix(x),"f")) 
  divide_one_norm = S/doc_norm 
  cosine = t(divide_one_norm)/doc_norm
  return (cosine)
}

输入 S 是点积矩阵。简而言之,S = dt %*% t(dt),其中 dt 是您的数据集。

该函数基本上是将点积除以向量的范数。

If you have a dot product matrix, you can use this function to compute the cosine similarity matrix:

get_cos = function(S){
  doc_norm = apply(as.matrix(dt),1,function(x) norm(as.matrix(x),"f")) 
  divide_one_norm = S/doc_norm 
  cosine = t(divide_one_norm)/doc_norm
  return (cosine)
}

Input S is the matrix of dot product. Simply, S = dt %*% t(dt), where dt is your dataset.

This function is basically to divide the dot product by the norms of vectors.

忆悲凉 2024-09-02 21:37:08

余弦相似度对于平移并不是不变的。相关相似度可能是更好的选择,因为它解决了这个问题,并且它还与平方欧几里得距离相关(如果数据标准化)

如果您有两个由特征的 p 维向量描述的对象,
x1x2都是p维度,可以通过cor(x1, x2)计算相关相似度。

请注意,在统计中,相关性被用作缩放矩概念,因此它自然被认为是随机变量之间的相关性。 cor(dataset) 函数将计算数据矩阵的列之间的相关性。

在使用 (nxp) 数据矩阵 X 的典型情况下,行上有单位(或对象),列上有变量(或特征),您可以计算相关相似度矩阵< /strong> 只需在 X 的转置上计算 cor,并为结果对象提供一个 dist

as.distance(cor(t(X)))

顺便说一下,您可以计算 >相关相异矩阵以同样的方式。下面区分物体向量之间的角度大小和

1 - cor(t(X))

方向这个不关心方向,只关心角度的大小

1 - abs(cor(t(X)))

The cosine similarity is not invariant to shift. The correlation similarity maybe a better choice because fixes this problem and it is also connected to squared Euclidean distances (if data are standardized)

If you have two objects described by p-dimensional vectors of features,
x1 and x2 both of dimension p, you can compute the correlation similarity by cor(x1, x2).

Note that in statistics correlation is used as a scaled moment notion, so it is naturally thought as correlation between random variables. The cor(dataset) function will compute correlations between columns of the data matrix.

In a typical situation with a (n x p) data matrix X, with units (or objects) on its rows, and variables (or features) on its columns you can compute the correlation similarity matrix simply by computing cor on the transpose of X, and giving the result object a dist class

as.distance(cor(t(X)))

By the way you can compute correlation dissimilarity matrix the same way. The following make a distinction about the size of the angle and the orientation between objects' vectors

1 - cor(t(X))

This one doesn't care about the orientation, only size of the angle

1 - abs(cor(t(X)))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文