如何在 python hcluster 中使用稀疏矩阵?

发布于 2024-10-06 07:32:26 字数 1219 浏览 0 评论 0原文

我正在尝试在 python 中使用 hcluster 库。我没有足够的 python 知识来在 hcluster 中使用稀疏矩阵。请任何人帮助我。所以,我正在做的事情:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO 

data.dmp 包含矩阵看起来像:

  A B C D
A 0 1 0 1 
B 1 0 0 1 
C 0 0 0 0 
D 1 1 0 0 

并且仅包含矩阵的右上部分。我不知道如何正确地用英语拼写:)所以,所有数字都高于主对角线 所以 data.dmp 包含: 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

由于我未知的原因,hcluster 使用反转值,例如,如果 A!=C,我使用 0,如果 A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

链接 Y,

Z = linkage(Y, method="complete")

我使用 1所以,矩阵 Z是我需要的(如果我正确使用了 hcluster?)

但我有下一个问题:

  1. 我想使用稀疏矩阵 大量的输入数据,因为是时候了 消耗生成输入数据 像现在一样,我需要将数据导入到 python 来自另一种语言,那就是 为什么我需要读取文本文件。请 好心,Python大师的建议如何 成功了吗?

  2. 致使用过Python的人 hcluster,我需要处理巨大的 数据量,数百行, 可以在hcluster中做吗? 这个算法确实产生了正确的结果 HAC?

感谢您的阅读,感谢您的帮助!

I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing:

import os.path
import numpy
import scipy
import scipy.io 
from hcluster import squareform, pdist, linkage, complete 
from hcluster.hierarchy import linkage, from_mlab_linkage 
from numpy import savetxt 
from StringIO import StringIO 

data.dmp contains matrix looks like:

  A B C D
A 0 1 0 1 
B 1 0 0 1 
C 0 0 0 0 
D 1 1 0 0 

and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal
so data.dmp contains : 1 0 1, 0 1 , 0

f = file('data.dmp','r')  
s = StringIO(f.readline()).getvalue()
f.close()

matrix = numpy.asarray(eval("["+s+"]"))

by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D

sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")

linkage Y

Z = linkage(Y, method="complete")

So, matrix Z is what I need (if I correctly used hcluster?)

But I have next problems:

  1. I want to use sparse matrix for
    the huge amount of input data, cause it's time
    consuming to generate input data
    like now, I need to import data to
    python from another language, thats
    why I need read text file. Please
    kindly, python guru's suggest how to
    make it?

  2. To people that used python
    hcluster, I need to process huge
    amount of data, hundreds of rows,
    it's possible to do in hcluster?
    This algorithm realy produce correct
    HAC?

Thank you for reading, I appreciate any help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

面犯桃花 2024-10-13 07:32:26

将每个输入表示为字典,从特征名称到值。字典中不存在零。

自己计算 Y 矩阵,而不是使用 hcluster.pdist。以下代码执行稀疏平方误差。平方误差相当于余弦距离如果您对所有特征向量进行 l2 归一化。

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

您应该为要计算的每个 Y[i,j] 元素调用 sqrerr。

将 Y 设为方阵,并确保 Y[i,j] == Y[j,i]。使用方法 hcluster.squareform 将 Y 转换为适合 hcluster.linkage 的形式。

Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.

Compute the Y matrix yourself, not using the hcluster.pdist. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.

def sqrerr(repr1, repr2):
    """
    Compute the sqrerr between two reprs.
    The reprs are each a dict from feature to feature value.
    """
    keys = frozenset(repr1.keys() + repr2.keys())
    sqrerr = 0.
    for k in keys:
        diff = repr1.get(k, 0.) - repr2.get(k, 0.)
        sqrerr += diff * diff
    return sqrerr

You should call sqrerr for every Y[i,j] element you want to compute.

Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method hcluster.squareform to convert Y to a form that is good for hcluster.linkage.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文