R/rpy2 中 as.dist 函数的内存问题

发布于 2024-10-24 08:44:46 字数 1017 浏览 4 评论 0原文

我正在尝试使用自定义距离度量来执行层次聚类。我在 Python 中执行所有计算，然后将数据结构传递给 R 进行聚类，

import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')

m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m) 
hc=stats.hclust(new_dist_mat)

因此我的距离测量值保存在 Python 列表中，转换为 R 矩阵，然后将其转换为 dist聚类所需的对象。这在一定程度上是有效的。但是，当矩阵变得太大并且我收到此错误时：

python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb

这种情况发生在我转换为 dist 对象 (as.dist) 时。我还没有测试过它在什么尺寸下会崩溃，但它适用于 3000x3000 矩阵，但不适用于 6500x6500 矩阵，所以介于两者之间。我正在使用 Python 中的 del 函数来尝试从内存中删除任何不必要的对象，但从我所读到的内容来看，这并不能保证内存将立即可供使用。

那么，最终是否有一种更节省内存的方法来获取 dist 对象？或者我可以使用其他方法吗？我在 R 的 cluster 库中发现了一些其他方法，它们不使用 dist 对象，但这些方法使用内置距离度量。

提前致谢！

原文

I'm trying to perform a hierarchical clustering using a custom distance measure. I perform all the calculations in Python and then pass the data structures to R to do the clustering

import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')

m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m) 
hc=stats.hclust(new_dist_mat)

So my distance measures are held in a Python list, converted to an R matrix, which is then converted into a dist object required for the clustering. This works to an extent. However, when the matrix becomes too big and I get this error:

python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb

This occurs at the point where I convert to a dist object (as.dist). I haven't tested at what size it falls apart, but it works with 3000x3000 matrix, but fails with a 6500x6500 matrix, so somewhere in-between.
I'm using the del function in Python to try remove any unnecessary objects from memory, but from what I've read this doesn't guarantee that the memory will become immediately available for use.

So, ultimately, is there a more memory efficient way to get a dist object? Or is there perhaps an alternative method I could use? I've found some other methods in R's cluster library, which do not use a dist object, but these methods use built-in distance metrics.

Thanks in advance!

分享到QQ

分享到微博