Scipy 负距离？什么？

发布于 2024-08-28 02:55:24 字数 3287 浏览 3 评论 0原文

我有一个输入文件，其中包含小数点后 4 位的浮点数：（

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ...

第一个是 id）。我的类使用 loadVectorsFromFile 方法将其乘以 10000，然后再乘以 int() 这些数字。最重要的是，我还循环遍历每个向量以确保内部没有负值。但是，当我执行 _hclustering 时，我不断看到错误，“链接Z包含负值”。

我认真地认为这是一个错误，因为：

我检查了我的值，
这些值没有足够小或足够大以接近浮点数的限制，并且
我用来导出文件中的值的公式使用绝对值（我的输入绝对正确）。

有人可以告诉我为什么我会看到这个奇怪的错误吗？到底是什么原因导致了这个负距离误差？

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

原文

I have an input file which contains floating point numbers to 4 decimal place:

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ...

(the first is the id).
My class uses the loadVectorsFromFile method which multiplies it by 10000 and then int() these numbers. On top of that, I also loop through each vector to ensure that there are no negative values inside. However, when I perform _hclustering, I am continually seeing the error, "LinkageZcontains negative values".

I seriously think this is a bug, because:

I checked my values,
the values are no where small enough or big enough to approach the limits of the floating point numbers and
the formula that I used to derive the values in the file uses absolute value (my input is DEFINITELY right).

Can someone enligten me as to why I am seeing this weird error? What is going on that is causing this negative distance error?

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撞了怀 2024-09-04 02:55:24

这是因为浮点不准确，因此向量之间的某些距离不是 0，而是例如 -0.000000000000000002。使用 scipy.clip() 函数来纠正该问题。如果您的距离矩阵是 dmatr，请使用 numpy.clip(dmatr,0,1,dmatr) 应该没问题。

回复收藏 0 原文

甜味超标? 2024-09-04 02:55:24

我很确定这是因为您在调用 fclusterdata 时使用了余弦度量。尝试使用欧几里德并查看错误是否消失。

如果集合中两个向量的点积大于 1，余弦度量可能会变为负值。由于您使用非常大的数字并对它们进行归一化，我非常确定点积很多时候都大于 1在你的数据集中。如果您想使用余弦度量，那么您需要对数据进行标准化，以使两个向量的点积永远不会大于 1。请参阅此页面查看 Scipy 中余弦度量的定义。

编辑：

嗯，通过查看源代码，我认为该页面上列出的公式实际上并不是 Scipy 使用的公式（这很好，因为源代码看起来像是使用正常的和正确的余弦距离公式）。然而，当它创建链接时，无论出于何种原因，链接中显然存在一些负值。尝试使用 scipy.spatial.distance.pdist() 和 method='cosine' 查找向量之间的距离，并检查负值。如果没有，则与如何使用距离值形成链接有关。

回复收藏 0 原文

唔猫 2024-09-04 02:55:24

“链接 Z 包含负值”。当链接矩阵中的任何链接簇索引分配为 -1 时，在 scipy 层次聚类过程中也会发生此错误。

根据我的观察，当所有要组合的簇或点之间的距离为负无穷时，任何链接簇索引在组合过程中都会被分配-1。因此，即使它们之间的链接距离是无穷大，链接函数也会将簇组合起来。并分配一个簇或点负索引

摘要
所以重点是，如果您使用余弦距离作为度量，如果任何数据点的范数或幅度为零，则会发生此错误

回复收藏 0 原文

默嘫て 2024-09-04 02:55:24

我有同样的问题。你能做的就是重写余弦函数。
例如：

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

I had the same issue. What you can do is rewrite the cosine function.
For example:

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

回复收藏 0 原文

浅唱ヾ落雨殇 2024-09-04 02:55:24

我无法改进贾斯汀的答案，但另一点值得注意的是你的数据处理。

您说您执行类似 int( float("0.0003") * 10000 ) 的操作来读取数据。但如果你这样做，你得到的不是 3 而是 2.9999999999999996。这是因为浮点误差会成倍增加。

更好，或者至少更准确。方法是在字符串中进行乘法。
也就是说，使用字符串操作从 0.0003 到 3.0 等等。

也许甚至在某个地方有一个 Python 数据类型扩展，它可以在不损失精度的情况下读取此类数据，您可以在转换之前执行乘法。我不擅长 SciPy/数值所以我不知道。

编辑

Justin 评论说 python 中有一个十进制类型构建。它可以解释字符串，与整数相乘并转换为浮点数（我测试过）。在这种情况下，我建议更新您的逻辑，例如：

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

这会稍微减少您的舍入问题。

I'm not able to improve the answer of Justin, but another point of note is your data handling.

You say you do something like int( float("0.0003") * 10000 ) to read the data. But if you do that you'd get not 3 but 2.9999999999999996. That's because the floating point inaccuracies just get multiplied.

A better, or at least more accurate. way would be by doing the multiplication in the string.
That is, using string manipulation to get from 0.0003 to 3.0 and so forth.

Perhaps there even is an Python data type extension somewhere which can read in this kind of data without loss of precision on which you can perform the multiplication before conversion. I'm not at home in SciPy/numerics so I don't know.

EDIT

Justin commented that there is a decimal type build within python. And that can interpret strings, multiply with integers and convert to float (I tested that). That being the case I would recommend updating your logic like:

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

That would at reduce your rounding problems a bit.

回复收藏 0 原文

~没有更多了~

关于作者

花之痕靓丽

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

Scipy 负距离？什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

qq_FjTq5B

18273202778

WordPress小学生

〃温暖了心ぐ

迷乱花海

niuniu

友情链接

Scipy 负距离？什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

qq_FjTq5B

18273202778

WordPress小学生

〃温暖了心ぐ

迷乱花海

niuniu

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。