Python 中的连续互信息

发布于 2024-12-19 10:16:26 字数 3801 浏览 5 评论 0原文

[Frontmatter]（如果您只是想问这个问题，请跳过此部分）：

我目前正在考虑使用 Shannon- Weaver 互信息和归一化冗余来衡量包之间的信息屏蔽程度按特征组织的离散和连续特征值。使用这种方法，我的目标是构建一个看起来与 ID3 非常相似的算法，但不是使用香农熵，算法将寻求（作为循环约束）最大化或最小化单个特征和特征之间的共享信息特征集合基于完整的输入特征空间，当（且仅当）它们分别增加或减少互信息时，将新特征添加到后一个集合中。实际上，这将 ID3 的决策算法移至对空间中，将集成方法与两种方法的所有预期时间和空间复杂度结合在一起。

[/Frontmatter]

关于这个问题：我试图让一个连续的积分器在 Python 中使用 < a href="http://www.scipy.org/">SciPy。因为我正在处理离散变量和连续变量的比较，所以我当前对特征-特征对的每次比较的策略如下：

离散特征与离散特征：使用互信息的离散形式。这会导致概率的双重求和，我的代码可以毫无问题地处理它。
所有其他情况（离散与连续、逆、连续与连续）：使用连续形式，使用高斯估计器< /a> 平滑概率密度函数。

我可以对后一种情况执行某种离散化，但由于我的输入数据集本质上不是线性的，因此这可能会不必要地复杂。

这是重要的代码：

import math
import numpy
import scipy
from scipy.stats import gaussian_kde
from scipy.integrate import dblquad

# Constants
MIN_DOUBLE = 4.9406564584124654e-324 
                    # The minimum size of a Float64; used here to prevent the
                    #  logarithmic function from hitting its undefined region
                    #  at its asymptote of 0.
INF = float('inf')  # The floating-point representation for "infinity"

# x and y are previously defined as collections of 
# floating point values with the same length

# Kernel estimation
gkde_x = gaussian_kde(x)
gkde_y = gaussian_kde(y)

if len(binned_x) != len(binned_y) and len(binned_x) != len(x):
    x.append(x[0])
    y.append(y[0])

gkde_xy = gaussian_kde([x,y])
mutual_info = lambda a,b: gkde_xy([a,b]) * \
           math.log((gkde_xy([a,b]) / (gkde_x(a) * gkde_y(b))) + MIN_DOUBLE)

# Compute MI(X,Y)
(minfo_xy, err_xy) = \
    dblquad(mutual_info, -INF, INF, lambda a: 0, lambda a: INF)

print 'minfo_xy = ', minfo_xy

请注意，故意多算一个点是为了防止 SciPy 的 gaussian_kde 类。当 x 和 y 的大小相互接近无穷大时，这种影响可以忽略不计。

我目前的障碍是试图让多重集成针对SciPy 中的高斯核密度估计。我一直在尝试使用 SciPy 的 dblquad 执行集成，但在后一种情况下，我收到了大量令人震惊的以下消息。

当我设置 numpy.seterr ( all='ignore' ) ：

警告：检测到舍入误差的发生，这会阻止达到要求的公差。错误可能是被低估了。

当我使用错误处理程序将其设置为 'call' 时：

浮点错误（下溢），标志为 4
浮点错误（无效值），标志为 8

很容易弄清楚发生了什么，对吧？好吧，差不多了：IEEE 754-2008 和 SciPy 只告诉我这里发生了什么，而不是为什么< /em> 或如何解决这个问题。

结果：minfo_xy 通常解析为 nan；它的采样不足以防止执行 Float64 数学运算时信息丢失或无效。

使用 SciPy 时是否有解决此问题的通用方法？

更好的是：如果 Python 有一个健壮的、固定的连续互信息实现，其接口采用两个浮点值集合或对的合并集合，那么它将解决这个完整的问题。如果您知道存在的话，请链接它。

提前谢谢您。

编辑：这解决了上例中的nan传播问题：

mutual_info = lambda a,b: gkde_xy([a,b]) * \
    math.log((gkde_xy([a,b]) / ((gkde_x(a) * gkde_y(b)) + MIN_DOUBLE)) \
        + MIN_DOUBLE)

但是，舍入校正的问题仍然存在，对更稳健的实现的要求也是如此。任何一个领域的帮助将不胜感激。

原文

[Frontmatter] (skip this if you just want the question):

I'm currently looking at using Shannon-Weaver Mutual Information and normalized redundancy to measure the degree of information masking between bags of discrete and continuous feature values, organized by feature. Using this method, it is my goal to construct an algorithm that looks very similar to ID3, but instead of using Shannon entropy, the algorithm will seek (as a loop constraint) to maximize or minimize shared information between a single feature and a collection of features based on the complete input feature space, adding new features to the latter collection if (and only if) they increase or decrease mutual information, respectively. This, in effect, moves ID3's decision algorithm into pairspace, stapling an ensemble approach to it with all of the expected time and space complexities of both methods.

[/Frontmatter]

On to the question: I'm trying to get a continuous integrator working in Python using SciPy. Because I'm working with comparisons of discrete and continuous variables, my current strategy for each comparison for feature-feature pairs is as follows:

Discrete feature versus discrete feature: use the discrete form of mutual information. This results in a double summation of the probabilities, which my code handles without issue.
All other cases (discrete versus continuous, the inverse, and continuous versus continuous): use the continuous form, using a Gaussian estimator to smooth out the probability density functions.

It is possible for me to perform some kind of discretization for the latter cases, but because my input data sets are not inherently linear, this is potentially needlessly complex.

Here's the salient code:

import math
import numpy
import scipy
from scipy.stats import gaussian_kde
from scipy.integrate import dblquad

# Constants
MIN_DOUBLE = 4.9406564584124654e-324 
                    # The minimum size of a Float64; used here to prevent the
                    #  logarithmic function from hitting its undefined region
                    #  at its asymptote of 0.
INF = float('inf')  # The floating-point representation for "infinity"

# x and y are previously defined as collections of 
# floating point values with the same length

# Kernel estimation
gkde_x = gaussian_kde(x)
gkde_y = gaussian_kde(y)

if len(binned_x) != len(binned_y) and len(binned_x) != len(x):
    x.append(x[0])
    y.append(y[0])

gkde_xy = gaussian_kde([x,y])
mutual_info = lambda a,b: gkde_xy([a,b]) * \
           math.log((gkde_xy([a,b]) / (gkde_x(a) * gkde_y(b))) + MIN_DOUBLE)

# Compute MI(X,Y)
(minfo_xy, err_xy) = \
    dblquad(mutual_info, -INF, INF, lambda a: 0, lambda a: INF)

print 'minfo_xy = ', minfo_xy

Note that overcounting exactly one point is done deliberately to prevent a singularity in SciPy's gaussian_kde class. As the size of x and y mutually approach infinity, this effect becomes negligible.

My current snag is in trying to get multiple integration working against a Gaussian kernel density estimate in SciPy. I've been trying to use SciPy's dblquad to perform the integration, but in the latter case, I receive an astounding spew of the following messages.

When I set numpy.seterr ( all='ignore' ):

Warning: The ocurrence of roundoff error is detected, which prevents
the requested tolerance from being achieved. The error may be
underestimated.

And when I set it to 'call' using an error handler:

Floating point error (underflow), with flag 4
Floating point error (invalid value), with flag 8

Pretty easy to figure out what's going on, right? Well, almost: IEEE 754-2008 and SciPy only tell me what's going on here, not why or how to work around it.

The upshot: minfo_xy generally resolves to nan; its sampling is insufficient to prevent information from becoming lost or invalid when performing Float64 math.

Is there a general workaround for this problem when using SciPy?

Even better: if there is a robust, canned implementation of continuous mutual information for Python with an interface that takes two collections of floating point values or a merged collection of pairs, it would resolve this complete problem. Please link it if you know of one that exists.

Thank you in advance.

Edit: this resolves the nan propagation issue in the example above:

mutual_info = lambda a,b: gkde_xy([a,b]) * \
    math.log((gkde_xy([a,b]) / ((gkde_x(a) * gkde_y(b)) + MIN_DOUBLE)) \
        + MIN_DOUBLE)

However, the question of roundoff correction remains, as does the request for a more robust implementation. Any help in either domain would be greatly appreciated.

分享到QQ

分享到微博