如何在 python hcluster 中使用稀疏矩阵？

发布于 2024-10-06 07:32:26 字数 1219 浏览 0 评论 0原文

我正在尝试在 python 中使用 hcluster 库。我没有足够的 python 知识来在 hcluster 中使用稀疏矩阵。请任何人帮助我。所以，我正在做的事情：

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO

data.dmp 包含矩阵看起来像：

并且仅包含矩阵的右上部分。我不知道如何正确地用英语拼写:)所以，所有数字都高于主对角线所以 data.dmp 包含： 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

由于我未知的原因，hcluster 使用反转值，例如，如果 A!=C，我使用 0，如果 A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

链接 Y，

Z = linkage(Y, method="complete")

我使用 1所以，矩阵 Z是我需要的（如果我正确使用了 hcluster？）

但我有下一个问题：

我想使用稀疏矩阵大量的输入数据，因为是时候了消耗生成输入数据像现在一样，我需要将数据导入到 python 来自另一种语言，那就是为什么我需要读取文本文件。请好心，Python大师的建议如何成功了吗？
致使用过Python的人 hcluster，我需要处理巨大的数据量，数百行，可以在hcluster中做吗？这个算法确实产生了正确的结果 HAC？

感谢您的阅读，感谢您的帮助！

原文

I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO

data.dmp contains matrix looks like:

and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal
so data.dmp contains : 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?)

But I have next problems:

I want to use sparse matrix for
the huge amount of input data, cause it's time
consuming to generate input data
like now, I need to import data to
python from another language, thats
why I need read text file. Please
kindly, python guru's suggest how to
make it?
To people that used python
hcluster, I need to process huge
amount of data, hundreds of rows,
it's possible to do in hcluster?
This algorithm realy produce correct
HAC?

Thank you for reading, I appreciate any help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

面犯桃花 2024-10-13 07:32:26

将每个输入表示为字典，从特征名称到值。字典中不存在零。

自己计算 Y 矩阵，而不是使用 hcluster.pdist。以下代码执行稀疏平方误差。平方误差相当于余弦距离如果您对所有特征向量进行 l2 归一化。

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

您应该为要计算的每个 Y[i,j] 元素调用 sqrerr。

将 Y 设为方阵，并确保 Y[i,j] == Y[j,i]。使用方法 hcluster.squareform 将 Y 转换为适合 hcluster.linkage 的形式。

Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.

Compute the Y matrix yourself, not using the hcluster.pdist. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute.

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage.

回复收藏 0 原文

~没有更多了~