R/rpy2 中 as.dist 函数的内存问题
我正在尝试使用自定义距离度量来执行层次聚类。我在 Python 中执行所有计算,然后将数据结构传递给 R 进行聚类,
import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')
m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m)
hc=stats.hclust(new_dist_mat)
因此我的距离测量值保存在 Python 列表中,转换为 R 矩阵,然后将其转换为 dist
聚类所需的对象。这在一定程度上是有效的。但是,当矩阵变得太大并且我收到此错误时:
python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb
这种情况发生在我转换为 dist
对象 (as.dist
) 时。我还没有测试过它在什么尺寸下会崩溃,但它适用于 3000x3000 矩阵,但不适用于 6500x6500 矩阵,所以介于两者之间。 我正在使用 Python 中的 del 函数来尝试从内存中删除任何不必要的对象,但从我所读到的内容来看,这并不能保证内存将立即可供使用。
那么,最终是否有一种更节省内存的方法来获取 dist
对象?或者我可以使用其他方法吗?我在 R 的 cluster 库中发现了一些其他方法,它们不使用 dist 对象,但这些方法使用内置距离度量。
提前致谢!
I'm trying to perform a hierarchical clustering using a custom distance measure. I perform all the calculations in Python and then pass the data structures to R to do the clustering
import rpy2.robjects as robjects
r=robjects.r
from rpy2.robjects.packages import importr
stats = importr('stats')
m = r.matrix(robjects.FloatVector(list_of_data), ncol=size, byrow=True)
dist_mat=stats.as_dist(m)
hc=stats.hclust(new_dist_mat)
So my distance measures are held in a Python list, converted to an R matrix, which is then converted into a dist
object required for the clustering. This works to an extent. However, when the matrix becomes too big and I get this error:
python(18944,0xb0081000) malloc: *** mmap(size=168898560) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 161.1 Mb
This occurs at the point where I convert to a dist
object (as.dist
). I haven't tested at what size it falls apart, but it works with 3000x3000 matrix, but fails with a 6500x6500 matrix, so somewhere in-between.
I'm using the del
function in Python to try remove any unnecessary objects from memory, but from what I've read this doesn't guarantee that the memory will become immediately available for use.
So, ultimately, is there a more memory efficient way to get a dist
object? Or is there perhaps an alternative method I could use? I've found some other methods in R's cluster
library, which do not use a dist
object, but these methods use built-in distance metrics.
Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
调用 Python 的 del() 并不能保证内存立即可供使用。显式调用垃圾收集器会有所帮助。此处另一个问题的答案(清除rpy2使用的内存)指向 rpy2 文档中的相关部分。
关于聚类算法,使用 hclust() 进行分层聚类确实需要一个“距离”矩阵(大小为 n * (n + 1) / 2 ;R 节省了一些内存,因为该矩阵是对称的)。还存在其他聚类算法,或者热衷于分层聚类技巧,通过创建初始块来最小化起始矩阵的大小,但这超出了与编程相关的问题的范围。
Calling Python's del() does not guarantee that the memory is becoming immediately available for use. Calling the garbage collector explicitly helps. The answer to an other question here (Clearing memory used by rpy2) points to the relevant section in the rpy2 documentation.
Regarding clustering algorithms hierachical clustering with hclust() does require a "distance" matrix (of size n * (n + 1) / 2 ; R saves a bit of memory since the matrix is symetrical). There exists other clustering algorithms, or if keen on hierachical clustering tricks to minimize the size of the starting matrix by creating initial blocks, but that's outside the scope of a programming-related question.