Scipy 负距离?什么?

发布于 2024-08-28 02:55:24 字数 3287 浏览 3 评论 0原文

我有一个输入文件,其中包含小数点后 4 位的浮点数:(

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ... 

第一个是 id)。 我的类使用 loadVectorsFromFile 方法将其乘以 10000,然后再乘以 int() 这些数字。最重要的是,我还循环遍历每个向量以确保内部没有负值。但是,当我执行 _hclustering 时,我不断看到错误,“链接Z包含负值”

我认真地认为这是一个错误,因为:

  1. 我检查了我的值,
  2. 这些值没有足够小或足够大以接近浮点数的限制,并且
  3. 我用来导出文件中的值的公式使用绝对值(我的输入绝对正确)。

有人可以告诉我为什么我会看到这个奇怪的错误吗?到底是什么原因导致了这个负距离误差?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

I have an input file which contains floating point numbers to 4 decimal place:

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ... 

(the first is the id).
My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "LinkageZcontains negative values".

I seriously think this is a bug, because:

  1. I checked my values,
  2. the values are no where small enough or big enough to approach the limits of the floating point numbers and
  3. the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right).

Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

撞了怀 2024-09-04 02:55:24

这是因为浮点不准确,因此向量之间的某些距离不是 0,而是例如 -0.000000000000000002。使用 scipy.clip() 函数来纠正该问题。如果您的距离矩阵是 dmatr,请使用 numpy.clip(dmatr,0,1,dmatr) 应该没问题。

This is because of floating-point inaccuracy, so some distances between your vectors, instead of being 0, are for example -0.000000000000000002. Use scipy.clip() function to correct the problem. If your distance matrix is dmatr, use numpy.clip(dmatr,0,1,dmatr) and you should be ok.

甜味超标? 2024-09-04 02:55:24

我很确定这是因为您在调用 fclusterdata 时使用了余弦度量。尝试使用欧几里德并查看错误是否消失。

如果集合中两个向量的点积大于 1,余弦度量可能会变为负值。由于您使用非常大的数字并对它们进行归一化,我非常确定点积很多时候都大于 1在你的数据集中。如果您想使用余弦度量,那么您需要对数据进行标准化,以使两个向量的点积永远不会大于 1。请参阅 此页面 查看 Scipy 中余弦度量的定义。

编辑:

嗯,通过查看源代码,我认为该页面上列出的公式实际上并不是 Scipy 使用的公式(这很好,因为源代码看起来像是使用正常的和正确的余弦距离公式)。然而,当它创建链接时,无论出于何种原因,链接中显然存在一些负值。尝试使用 scipy.spatial.distance.pdist() 和 method='cosine' 查找向量之间的距离,并检查负值。如果没有,则与如何使用距离值形成链接有关。

I'm pretty sure that this is because you are using the cosine metric when you are calling fclusterdata. Try using euclidean and see if the error goes away.

The cosine metric can go negative if the dot product of two vectors in your set is greater than 1. Since you are using very large numbers and normalizing them, I'm pretty sure that the dot products are greater than 1 a lot of the time in your data set. If you want to use the cosine metric, then you'll need to normalize your data such that the dot product of two vectors is never greater than 1. See the formula on this page to see what the cosine metric is defined as in Scipy.

Edit:

Well, from looking at the source code I think that the formula listed on that page isn't actually the formula that Scipy uses (which is good because the source code looks like it is using the normal and correct cosine distance formula). However, by the time it creates to the linkage, there are clearly some negative values in the linkage for whatever reason. Try finding the distance between your vectors with scipy.spatial.distance.pdist() with method='cosine' and check for negative values. If there aren't any, then it has to do with how the linkage is formed using the distance values.

唔猫 2024-09-04 02:55:24

“链接 Z 包含负值”。当链接矩阵中的任何链接簇索引分配为 -1 时,在 scipy 层次聚类过程中也会发生此错误。

根据我的观察,当所有要组合的簇或点之间的距离为负无穷时,任何链接簇索引在组合过程中都会被分配-1。因此,即使它们之间的链接距离是无穷大,链接函数也会将簇组合起来。并分配一个簇或点负索引

摘要
所以重点是,如果您使用 余弦距离 作为度量,如果任何数据点的范数或幅度为零,则会发生此错误

"Linkage Z contains negative values". This error also occurs in scipy heirarchical clustering process when any linkage cluster index in linkage matrix is assigned -1.

As per my observations, any linkage cluster index gets assigned -1 during the combine processs, when the distance between all pairs of clusters or points to combine, comes out to be minus infinity. So linkage function combines clusters with even if linkage distance between them is -infinite. And assign one of the cluster or point negative index

summary
So the point is, if you are using cosine distance as metric and if the norm or magnitude of any data point is zero, then this error will occurs

默嘫て 2024-09-04 02:55:24

我有同样的问题。你能做的就是重写余弦函数。
例如:

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

I had the same issue. What you can do is rewrite the cosine function.
For example:

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')
浅唱ヾ落雨殇 2024-09-04 02:55:24

我无法改进贾斯汀的答案,但另一点值得注意的是你的数据处理。

您说您执行类似 int( float("0.0003") * 10000 ) 的操作来读取数据。但如果你这样做,你得到的不是 3 而是 2.9999999999999996。这是因为浮点误差会成倍增加。

更好,或者至少更准确。方法是在字符串中进行乘法。
也就是说,使用字符串操作从 0.00033.0 等等。

也许甚至在某个地方有一个 Python 数据类型扩展,它可以在不损失精度的情况下读取此类数据,您可以在转换之前执行乘法。我不擅长 SciPy/数值所以我不知道。

编辑

Justin 评论说 python 中有一个十进制类型构建。它可以解释字符串,与整数相乘并转换为浮点数(我测试过)。在这种情况下,我建议更新您的逻辑,例如:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

这会稍微减少您的舍入问题。

I'm not able to improve the answer of Justin, but another point of note is your data handling.

You say you do something like int( float("0.0003") * 10000 ) to read the data. But if you do that you'd get not 3 but 2.9999999999999996. That's because the floating point inaccuracies just get multiplied.

A better, or at least more accurate. way would be by doing the multiplication in the string.
That is, using string manipulation to get from 0.0003 to 3.0 and so forth.

Perhaps there even is an Python data type extension somewhere which can read in this kind of data without loss of precision on which you can perform the multiplication before conversion. I'm not at home in SciPy/numerics so I don't know.

EDIT

Justin commented that there is a decimal type build within python. And that can interpret strings, multiply with integers and convert to float (I tested that). That being the case I would recommend updating your logic like:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

That would at reduce your rounding problems a bit.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文