如何在 python hcluster 中使用稀疏矩阵?
我正在尝试在 python 中使用 hcluster 库。我没有足够的 python 知识来在 hcluster 中使用稀疏矩阵。请任何人帮助我。所以,我正在做的事情:
import os.path
import numpy
import scipy
import scipy.io
from hcluster import squareform, pdist, linkage, complete
from hcluster.hierarchy import linkage, from_mlab_linkage
from numpy import savetxt
from StringIO import StringIO
data.dmp 包含矩阵看起来像:
A B C D
A 0 1 0 1
B 1 0 0 1
C 0 0 0 0
D 1 1 0 0
并且仅包含矩阵的右上部分。我不知道如何正确地用英语拼写:)所以,所有数字都高于主对角线 所以 data.dmp 包含: 1 0 1, 0 1 , 0
f = file('data.dmp','r')
s = StringIO(f.readline()).getvalue()
f.close()
matrix = numpy.asarray(eval("["+s+"]"))
由于我未知的原因,hcluster 使用反转值,例如,如果 A!=C,我使用 0,如果 A == D
sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")
链接 Y,
Z = linkage(Y, method="complete")
我使用 1所以,矩阵 Z是我需要的(如果我正确使用了 hcluster?)
但我有下一个问题:
我想使用稀疏矩阵 大量的输入数据,因为是时候了 消耗生成输入数据 像现在一样,我需要将数据导入到 python 来自另一种语言,那就是 为什么我需要读取文本文件。请 好心,Python大师的建议如何 成功了吗?
致使用过Python的人 hcluster,我需要处理巨大的 数据量,数百行, 可以在hcluster中做吗? 这个算法确实产生了正确的结果 HAC?
感谢您的阅读,感谢您的帮助!
I'm trying to use hcluster library in python. I have no enough python knowledges to use sparse matrix in hcluster. Please help me anybody. So, that what I'm doing:
import os.path
import numpy
import scipy
import scipy.io
from hcluster import squareform, pdist, linkage, complete
from hcluster.hierarchy import linkage, from_mlab_linkage
from numpy import savetxt
from StringIO import StringIO
data.dmp contains matrix looks like:
A B C D
A 0 1 0 1
B 1 0 0 1
C 0 0 0 0
D 1 1 0 0
and contains only upper-right part of matrix. I don't know how to spell it in english correctly :) so, all numbers upper than main diagonal
so data.dmp contains : 1 0 1, 0 1 , 0
f = file('data.dmp','r')
s = StringIO(f.readline()).getvalue()
f.close()
matrix = numpy.asarray(eval("["+s+"]"))
by unknown reason for me, hcluster uses inverted values, for example I use 0 if A!=C,and use 1 if A == D
sqfrm = squareform(matrix)
Y = pdist(sqfrm, metric="cosine")
linkage Y
Z = linkage(Y, method="complete")
So, matrix Z is what I need (if I correctly used hcluster?)
But I have next problems:
I want to use sparse matrix for
the huge amount of input data, cause it's time
consuming to generate input data
like now, I need to import data to
python from another language, thats
why I need read text file. Please
kindly, python guru's suggest how to
make it?To people that used python
hcluster, I need to process huge
amount of data, hundreds of rows,
it's possible to do in hcluster?
This algorithm realy produce correct
HAC?
Thank you for reading, I appreciate any help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
将每个输入表示为字典,从特征名称到值。字典中不存在零。
自己计算 Y 矩阵,而不是使用
hcluster.pdist
。以下代码执行稀疏平方误差。平方误差相当于余弦距离如果您对所有特征向量进行 l2 归一化。您应该为要计算的每个 Y[i,j] 元素调用 sqrerr。
将 Y 设为方阵,并确保 Y[i,j] == Y[j,i]。使用方法
hcluster.squareform
将 Y 转换为适合hcluster.linkage
的形式。Represent the inputs each as a dictionary, from feature name to value. Zeros are not present in the dictionary.
Compute the Y matrix yourself, not using the
hcluster.pdist
. The following code does sparse squared-error. Squared-error is equivalent to cosine distance IF you l2-normalize all feature vectors.You should call sqrerr for every Y[i,j] element you want to compute.
Make Y a square matrix, and make sure that Y[i,j] == Y[j,i]. Use method
hcluster.squareform
to convert Y to a form that is good forhcluster.linkage
.