什么是“熵和信息增益”?

发布于 2024-08-14 05:18:10 字数 330 浏览 9 评论 0 原文

我正在读这本书(NLTK),它很令人困惑。 定义为 :

熵是每个标签的概率之和 乘以同一标签的对数概率

如何在文本挖掘中应用最大熵?有人能给我一个简单、简单的例子(视觉)吗?

I am reading this book (NLTK) and it is confusing. Entropy is defined as:

Entropy is the sum of the probability of each label
times the log probability of that same label

How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

随心而道 2024-08-21 05:18:10

我假设在构建决策树的背景下提到了熵。

为了说明这一点,想象一下学习将名字分为男性/女性组。给定一个名称列表,每个名称都标有 mf,我们想要学习 模型 适合数据,可用于预测新的未见过的名字的性别。

name       gender
-----------------        Now we want to predict 
Ashley        f              the gender of "Amro" (my name)
Brian         m
Caroline      f
David         m

第一步是决定什么数据的特征与我们要预测的目标类别相关。一些示例特征包括:第一个/最后一个字母、长度、元音数量、是否以元音结尾等。因此,在特征提取之后,我们的数据看起来像:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

目标是构建一个 决策树的一个示例是:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

基本上每个节点代表在单个属性,我们根据测试结果向左或向右移动。我们继续遍历树,直到到达包含类预测(mf)的叶节点,

因此,如果我们向下运行名称 Amro对于这棵树,我们首先测试“长度<7?”,答案是,所以我们沿着那个分支走下去。分支之后,下一个测试“元音数量<3?”再次评估为true。这导致了一个标记为 m 的叶节点,因此预测是 male (我碰巧是,所以树预测了结果 正确)。

决策树以自上而下的方式构建,但问题是如何选择哪个要在每个节点拆分的属性?答案是找到最能将目标类拆分为最纯粹的子节点的特征(即:不包含男性和女性混合的节点,而是仅包含一个类的纯节点)。

这种纯度度量称为信息。它代表预期金额信息,在给定到达节点的示例的情况下,指定新实例(名字)是否应分类为男性或女性所需的信息。我们计算一下
基于节点处男性和女性班级的数量。

另一方面,是杂质的度量< /em>(相反)。它是为 二进制类 定义的,值为 a/ b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

这个二元熵函数如下图所示(随机变量可以取两个值之一)。当概率为 p=1/2 时,它达到最大值,这意味着 p(X=a)=0.5 或类似的p(X=b)= 0.5 有 50%/50% 的机会成为 ab(不确定性最大)。当概率完全确定为 p=1p=0 时,熵函数的最小值为零 (p(X=a)=1p(X=a)=0 ,后者意味着 p(X=b)=1)。

https://en.wikipedia.org/wiki/File:Binary_entropy_plot.svg

当然熵的定义是可以推广的对于具有 N 个结果(不仅仅是两个)的离散随机变量 X:

entropy

(log<公式中的 /code> 通常取以 2 为底的对数


回到我们的名称分类任务,让我们看一个例子。想象一下在构建树的过程中的某个时刻,我们正在考虑以下分割:

     ends-vowel
      [9m,5f]          <--- the [..,..] notation represents the class
    /          \            distribution of instances that reached a node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

如您所见,在分割之前我们有 9 个男性和 5 个女性,即 P(m)=9/14P(f)=5/14。根据熵的定义:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

接下来我们将它与通过查看两个子分支考虑分割后计算出的熵进行比较。在 ends-vowel=1 的左分支中,我们有:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

ends-vowel=0 的右分支中,我们有:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

我们使用以下方法组合左/右熵:每个分支的实例数为 权重因子(7 个实例向左,7 个实例向右),并获得分割后的最终熵:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

现在通过比较分割前后的熵,我们获得 信息增益,或者我们通过使用该特定功能进行分割获得了多少信息:

Information_Gain = Entropy_before - Entropy_after = 0.1518

您可以将上述计算解释如下:通过使用 < code>end-vowels 功能,我们能够将子树预测结果的不确定性减少 0.1518(在 作为信息单位)。

在树的每个节点,对每个特征执行此计算,并选择具有最大信息增益的特征用于 贪婪方式(因此有利于产生具有低不确定性/熵的分裂的特征)。此过程从根节点向下递归应用,并在叶节点包含所有具有相同类的实例时停止(无需进一步拆分)。

请注意,我跳过了一些详细信息,这些内容超出了本文的范围,包括如何处理数字特征缺失值过度拟合修剪树木等。

I assume entropy was mentioned in the context of building decision trees.

To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model that fits the data and can be used to predict the gender of a new unseen first-name.

name       gender
-----------------        Now we want to predict 
Ashley        f              the gender of "Amro" (my name)
Brian         m
Caroline      f
David         m

First step is deciding what features of the data are relevant to the target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

The goal is to build a decision tree. An example of a tree would be:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach a leaf node which contains the class prediction (m or f)

So if we run the name Amro down this tree, we start by testing "is the length<7?" and the answer is yes, so we go down that branch. Following the branch, the next test "is the number of vowels<3?" again evaluates to true. This leads to a leaf node labeled m, and thus the prediction is male (which I happen to be, so the tree predicted the outcome correctly).

The decision tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).

This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example that reached the node. We calculate it
based on the number of male and female classes at the node.

Entropy on the other hand is a measure of impurity (the opposite). It is defined for a binary class with values a/b as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

This binary entropy function is depicted in the figure below (random variable can take one of two values). It reaches its maximum when the probability is p=1/2, meaning that p(X=a)=0.5 or similarlyp(X=b)=0.5 having a 50%/50% chance of being either a or b (uncertainty is at a maximum). The entropy function is at zero minimum when probability is p=1 or p=0 with complete certainty (p(X=a)=1 or p(X=a)=0 respectively, latter implies p(X=b)=1).

https://en.wikipedia.org/wiki/File:Binary_entropy_plot.svg

Of course the definition of entropy can be generalized for a discrete random variable X with N outcomes (not just two):

entropy

(the log in the formula is usually taken as the logarithm to the base 2)


Back to our task of name classification, lets look at an example. Imagine at some point during the process of constructing the tree, we were considering the following split:

     ends-vowel
      [9m,5f]          <--- the [..,..] notation represents the class
    /          \            distribution of instances that reached a node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

As you can see, before the split we had 9 males and 5 females, i.e. P(m)=9/14 and P(f)=5/14. According to the definition of entropy:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

Next we compare it with the entropy computed after considering the split by looking at two child branches. In the left branch of ends-vowel=1, we have:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

and the right branch of ends-vowel=0, we have:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

We combine the left/right entropies using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and get the final entropy after the split:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:

Information_Gain = Entropy_before - Entropy_after = 0.1518

You can interpret the above calculation as following: by doing the split with the end-vowels feature, we were able to reduce uncertainty in the sub-tree prediction outcome by a small amount of 0.1518 (measured in bits as units of information).

At each node of the tree, this calculation is performed for every feature, and the feature with the largest information gain is chosen for the split in a greedy manner (thus favoring features that produce pure splits with low uncertainty/entropy). This process is applied recursively from the root-node down, and stops when a leaf node contains instances all having the same class (no need to split it further).

Note that I skipped over some details which are beyond the scope of this post, including how to handle numeric features, missing values, overfitting and pruning trees, etc..

沉默的熊 2024-08-21 05:18:10

首先,最好了解信息的度量

我们如何衡量信息?

当不太可能发生的事情发生时,我们说这是一个大新闻。而且,当我们说一些可预测的事情时,它并不是很有趣。因此,为了量化这种有趣性

  • 如果事件的概率为 1(可预测),则
  • 函数应满足,如果事件的概率接近 0,则函数给出 0,则函数 如果发生概率为 0.5 的事件,则应给出较高的数字,
  • 它给出一位信息。

满足约束的一种自然度量是

I(X) = -log_2(p)

其中 p 是事件 X 的概率。单位是,与计算机使用的位相同。 0 或 1。

示例 1

公平抛硬币:

我们可以从一次抛硬币中获得多少信息?

答案:-log(p) = -log(1/2) = 1(位)

示例 2

如果明天有一颗流星撞击地球,p=2^{-22} code> 那么我们就可以得到22位的信息。

如果太阳明天升起,p ~ 1 那么它就是 0 位信息。

因此,如果我们对事件Y有趣性进行期望,那么它就是熵。
即熵是事件的兴趣度的期望值。

H(Y) = E[ I(Y)]

更正式地说,熵是事件的预期位数。

示例

Y = 1:事件 X 发生的概率为 p

Y = 0:事件 X 不发生的概率为 1-p

H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) 
     = - p log p - (1-p) log (1-p)

所有对数的对数以 2 为底。

To begin with, it would be best to understand the measure of information.

How do we measure the information?

When something unlikely happens, we say it's a big news. Also, when we say something predictable, it's not really interesting. So to quantify this interesting-ness, the function should satisfy

  • if the probability of the event is 1 (predictable), then the function gives 0
  • if the probability of the event is close to 0, then the function should give high number
  • if probability 0.5 events happens it give one bit of information.

One natural measure that satisfy the constraints is

I(X) = -log_2(p)

where p is the probability of the event X. And the unit is in bit, the same bit computer uses. 0 or 1.

Example 1

Fair coin flip :

How much information can we get from one coin flip?

Answer : -log(p) = -log(1/2) = 1 (bit)

Example 2

If a meteor strikes the Earth tomorrow, p=2^{-22} then we can get 22 bits of information.

If the Sun rises tomorrow, p ~ 1 then it is 0 bit of information.

Entropy

So if we take expectation on the interesting-ness of an event Y, then it is the entropy.
i.e. entropy is an expected value of the interesting-ness of an event.

H(Y) = E[ I(Y)]

More formally, the entropy is the expected number of bits of an event.

Example

Y = 1 : an event X occurs with probability p

Y = 0 : an event X does not occur with probability 1-p

H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) 
     = - p log p - (1-p) log (1-p)

Log base 2 for all log.

白鸥掠海 2024-08-21 05:18:10

我无法给你图表,但也许我可以给出一个清晰的解释。

假设我们有一个信息通道,例如每天闪烁一次的红灯或绿灯。它传达了多少信息?第一个猜测可能是每天一点。但是如果我们添加蓝色,以便发送者有三个选择呢?我们希望有一种信息度量,可以处理除 2 的幂之外的其他事物,但仍然是可加的(将可能的消息数量乘以 2 的方式一位)。我们可以通过 log2(可能的消息数)来做到这一点,但事实证明有一种更通用的方法。

假设我们回到红色/绿色,但红色灯泡已经烧坏(这是常识),因此灯必须始终闪烁绿色。该通道现在毫无用处,我们知道下一次闪光会是什么,因此闪光不传达任何信息,没有新闻。现在我们修理灯泡,但规定红色灯泡不能连续闪烁两次。当灯闪烁红色时,我们知道下一次闪烁是什么。如果您尝试通过此通道发送比特流,您会发现必须使用比比特数更多的闪烁对其进行编码(实际上多了 50%)。如果您想描述一系列闪烁,则可以使用更少的位来实现。如果每个闪烁都是独立的(与上下文无关),则同样适用,但绿色闪烁比红色闪烁更常见:概率越大,描述序列所需的位就越少,并且它包含的信息也越少,一直到全绿色,灯泡烧坏限制。

事实证明,有一种方法可以根据不同符号的概率来测量信号中的信息量。如果接收符号xi的概率为pi,那么考虑数量

-log pi

pi越小,这个值就越大。如果 xi 的可能性增加两倍,则该值会增加固定量 (log(2))。这应该提醒您向消息添加一位。

如果我们不知道符号是什么(但我们知道概率),那么我们可以通过对不同的可能性求和来计算该值的平均值,即我们将得到多少:

I = -Σ pi log(pi)

这是瞬间的信息内容。

Red bulb burnt out: pred = 0, pgreen=1, I = -(0 + 0)  = 0
Red and green equiprobable: pred = 1/2, pgreen = 1/2, I = -(2 * 1/2 * log(1/2)) = log(2)
Three colors, equiprobable: pi=1/3, I = -(3 * 1/3 * log(1/3)) = log(3)
Green and red, green twice as likely: pred=1/3, pgreen=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

这是消息的信息内容或熵。当不同符号等概率时它最大。如果您是物理学家,则使用自然对数;如果您是计算机科学家,则使用 log2 并获取位。

I can't give you graphics, but maybe I can give a clear explanation.

Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two adds one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.

Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, we know what the next flash will be so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.

It turns out there's a way to measure the amount of information in a signal, based on the the probabilities of the different symbols. If the probability of receiving symbol xi is pi, then consider the quantity

-log pi

The smaller pi, the larger this value. If xi becomes twice as unlikely, this value increases by a fixed amount (log(2)). This should remind you of adding one bit to a message.

If we don't know what the symbol will be (but we know the probabilities) then we can calculate the average of this value, how much we will get, by summing over the different possibilities:

I = -Σ pi log(pi)

This is the information content in one flash.

Red bulb burnt out: pred = 0, pgreen=1, I = -(0 + 0)  = 0
Red and green equiprobable: pred = 1/2, pgreen = 1/2, I = -(2 * 1/2 * log(1/2)) = log(2)
Three colors, equiprobable: pi=1/3, I = -(3 * 1/3 * log(1/3)) = log(3)
Green and red, green twice as likely: pred=1/3, pgreen=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

This is the information content, or entropy, of the message. It is maximal when the different symbols are equiprobable. If you're a physicist you use the natural log, if you're a computer scientist you use log2 and get bits.

一杆小烟枪 2024-08-21 05:18:10

我强烈建议您阅读信息论、贝叶斯方法和 MaxEnt。首先是 David Mackay 的这本书(可在线免费获取):

http: //www.inference.phy.cam.ac.uk/mackay/itila/

这些推理方法实际上比文本挖掘更通用,我无法真正设计出如何学习如何将其应用到NLP 无需学习本书或其他机器学习和 MaxEnt 贝叶斯方法入门书籍中包含的一些一般基础知识。

熵和概率论与信息处理和存储之间的联系非常非常深刻。为了让您体验一下,香农提出了一个定理,该定理指出,通过噪声通信通道可以无差错地传递的最大信息量等于噪声过程的熵。还有一个定理将一段数据的压缩程度与生成该数据的过程的熵联系起来,以占用计算机中尽可能小的内存。

我认为你没有必要去学习所有这些通信理论定理,但是如果不学习什么是熵、它是如何计算的、它与信息和推理的关系是什么等基础知识,就不可能学习这些定理。 ...

I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...

夜夜流光相皎洁 2024-08-21 05:18:10

非正式

是信息或知识的可用性,缺乏信息将导致对未来的预测困难,这是高熵(文本挖掘中的下一个单词预测)和信息的可用性/知识将帮助我们对未来进行更现实的预测(低熵)。

任何类型的相关信息都会减少熵并帮助我们预测更现实的未来,该信息可以是单词“肉”出现在句子中或单词“肉”不存在。这称为信息增益


形式上

缺乏可预测性的顺序

Informally

entropy is availability of information or knowledge, Lack of information will leads to difficulties in prediction of future which is high entropy (next word prediction in text mining) and availability of information/knowledge will help us more realistic prediction of future (low entropy).

Relevant information of any type will reduce entropy and helps us predict more realistic future, that information can be word "meat" is present in sentence or word "meat" is not present. This is called Information Gain


Formally

entropy is lack of order of predicability

不及他 2024-08-21 05:18:10

当我实现计算图像熵的算法时,我发现了这些链接,请参阅 此处此处

这是我使用的伪代码,您需要对其进行调整以处理文本而不是图像,但原理应该是相同的。

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

我从某个地方得到了这个代码,但我无法挖掘出链接。

When I was implementing an algorithm to calculate the entropy of an image I found these links, see here and here.

This is the pseudo-code I used, you'll need to adapt it to work with text rather than images but the principles should be the same.

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

I got this code from somewhere, but I can't dig out the link.

紫轩蝶泪 2024-08-21 05:18:10

当您正在阅读一本有关 NLTK 的书时,您会很感兴趣地了解 MaxEnt 分类器模块 http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent

对于文本挖掘分类,步骤可能是:预处理(标记化、流化) ,具有信息增益的特征选择...),转换为数字(频率或 TF-IDF)(我认为这是使用文本作为仅接受数字的算法的输入时理解的关键步骤),然后使用 MaxEnt 进行分类,当然这只是一个例子。

As you are reading a book about NLTK it would be interesting you read about MaxEnt Classifier Module http://www.nltk.org/api/nltk.classify.html#module-nltk.classify.maxent

For text mining classification the steps could be: pre-processing (tokenization, steaming, feature selection with Information Gain ...), transformation to numeric (frequency or TF-IDF) (I think that this is the key step to understand when using text as input to a algorithm that only accept numeric) and then classify with MaxEnt, sure this is just an example.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文