R/rpy2 中 as.dist 函数的内存问题

发布于 2024-10-24 08:44:46 字数 1017 浏览 4 评论 0原文

我正在尝试使用自定义距离度量来执行层次聚类。我在 Python 中执行所有计算,然后将数据结构传递给 R 进行聚类,

import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')

m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m) 
hc=stats.hclust(new_dist_mat)

因此我的距离测量值保存在 Python 列表中,转换为 R 矩阵,然后将其转换为 dist聚类所需的对象。这在一定程度上是有效的。但是,当矩阵变得太大并且我收到此错误时:

python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb

这种情况发生在我转换为 dist 对象 (as.dist) 时。我还没有测试过它在什么尺寸下会崩溃,但它适用于 3000x3000 矩阵,但不适用于 6500x6500 矩阵,所以介于两者之间。 我正在使用 Python 中的 del 函数来尝试从内存中删除任何不必要的对象,但从我所读到的内容来看,这并不能保证内存将立即可供使用。

那么,最终是否有一种更节省内存的方法来获取 dist 对象?或者我可以使用其他方法吗?我在 R 的 cluster 库中发现了一些其他方法,它们不使用 dist 对象,但这些方法使用内置距离度量。

提前致谢!

I'm trying to perform a hierarchical clustering using a custom distance measure. I perform all the calculations in Python and then pass the data structures to R to do the clustering

import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')

m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m) 
hc=stats.hclust(new_dist_mat)

So my distance measures are held in a Python list, converted to an R matrix, which is then converted into a dist object required for the clustering. This works to an extent. However, when the matrix becomes too big and I get this error:

python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb

This occurs at the point where I convert to a dist object (as.dist). I haven't tested at what size it falls apart, but it works with 3000x3000 matrix, but fails with a 6500x6500 matrix, so somewhere in-between.
I'm using the del function in Python to try remove any unnecessary objects from memory, but from what I've read this doesn't guarantee that the memory will become immediately available for use.

So, ultimately, is there a more memory efficient way to get a dist object? Or is there perhaps an alternative method I could use? I've found some other methods in R's cluster library, which do not use a dist object, but these methods use built-in distance metrics.

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

葵雨 2024-10-31 08:44:46

调用 Python 的 del() 并不能保证内存立即可供使用。显式调用垃圾收集器会有所帮助。此处另一个问题的答案(清除rpy2使用的内存)指向 rpy2 文档中的相关部分。

关于聚类算法,使用 hclust() 进行分层聚类确实需要一个“距离”矩阵(大小为 n * (n + 1) / 2 ;R 节省了一些内存,因为该矩阵是对称的)。还存在其他聚类算法,或者热衷于分层聚类技巧,通过创建初始块来最小化起始矩阵的大小,但这超出了与编程相关的问题的范围。

Calling Python's del() does not guarantee that the memory is becoming immediately available for use. Calling the garbage collector explicitly helps. The answer to an other question here (Clearing memory used by rpy2) points to the relevant section in the rpy2 documentation.

Regarding clustering algorithms hierachical clustering with hclust() does require a "distance" matrix (of size n * (n + 1) / 2 ; R saves a bit of memory since the matrix is symetrical). There exists other clustering algorithms, or if keen on hierachical clustering tricks to minimize the size of the starting matrix by creating initial blocks, but that's outside the scope of a programming-related question.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文