如何计算位串的近似熵?

发布于 2024-09-04 06:59:16 字数 494 浏览 8 评论 0原文

有没有标准的方法来做到这一点?

谷歌搜索 -- “近似熵”位 --揭示了多篇学术论文,但我只想找到一段伪代码,定义任意长度的给定位串的近似熵。

(如果这说起来容易做起来难,并且取决于应用程序,我的应用程序涉及 16,320 位加密数据(密文)。但加密是一个谜题,并不意味着不可能破解。我想我应该首先检查熵,但很难找到一个好的定义,所以这似乎是一个应该出现在 StackOverflow 上的问题!也欢迎从哪里开始解密 16k 看似随机的位......)

另请参阅此相关问题:
熵的计算机科学定义是什么?

Is there a standard way to do this?

Googling -- "approximate entropy" bits -- uncovers multiple academic papers but I'd like to just find a chunk of pseudocode defining the approximate entropy for a given bit string of arbitrary length.

(In case this is easier said than done and it depends on the application, my application involves 16,320 bits of encrypted data (cyphertext). But encrypted as a puzzle and not meant to be impossible to crack. I thought I'd first check the entropy but couldn't easily find a good definition of such. So it seemed like a question that ought to be on StackOverflow! Ideas for where to begin with de-cyphering 16k random-seeming bits are also welcome...)

See also this related question:
What is the computer science definition of entropy?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

天涯离梦残月幽梦 2024-09-11 06:59:16

熵不是您获得的字符串的属性,而是您本来可以获得的字符串的属性。换句话说,它限定了生成字符串的进程

在简单的情况下,您会从一组 N 个可能的字符串中获得一个字符串,其中每个字符串被选择的概率与其他字符串相同,即 1/N。在这种情况下,字符串的熵为 N。熵通常以位表示,这是一个对数标度:“n位”的熵等于2n的熵。

例如:我喜欢将密码生成为两个小写字母,然后是两个数字,然后是两个小写字母,最后是两个数字(例如 va85mw24)。字母和数字是随机、统一且彼此独立选择的。此过程可能会产生 26*26*10*10*26*26*10*10 = 4569760000 个不同的密码,并且所有这些密码被选择的机会均等。这样一个密码的熵就是 4569760000,这意味着大约 32.1 位。

Entropy is not a property of the string you got, but of the strings you could have obtained instead. In other words, it qualifies the process by which the string was generated.

In the simple case, you get one string among a set of N possible strings, where each string has the same probability of being chosen than every other, i.e. 1/N. In the situation, the string is said to have an entropy of N. The entropy is often expressed in bits, which is a logarithmic scale: an entropy of "n bits" is an entropy equal to 2n.

For instance: I like to generate my passwords as two lowercase letters, then two digits, then two lowercase letters, and finally two digits (e.g. va85mw24). Letters and digits are chosen randomly, uniformly, and independently of each other. This process may produce 26*26*10*10*26*26*10*10 = 4569760000 distinct passwords, and all these passwords have equal chances to be selected. The entropy of such a password is then 4569760000, which means about 32.1 bits.

2024-09-11 06:59:16

香农熵方程是标准计算方法。这是一个简单的 Python 实现,无耻地从 Revelation 代码库复制而来,因此获得了 GPL 许可

import math


def entropy(string):
    "Calculates the Shannon entropy of a string"

    # get probability of chars in string
    prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]

    # calculate the entropy
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])

    return entropy


def entropy_ideal(length):
    "Calculates the ideal Shannon entropy of a string with given length"

    prob = 1.0 / length

    return -1.0 * length * prob * math.log(prob) / math.log(2.0)

:此实现假设您的输入比特流最好表示为字节。对于您的问题域来说,这可能是也可能不是。你真正想要的是你的比特流转换成一串数字。您如何决定这些数字是特定于领域的。如果您的数字确实只是一和零,那么请将您的比特流转换为一和零的数组。但是,您选择的转换方法将影响您获得的结果。

Shannon's entropy equation is the standard method of calculation. Here is a simple implementation in Python, shamelessly copied from the Revelation codebase, and thus GPL licensed:

import math


def entropy(string):
    "Calculates the Shannon entropy of a string"

    # get probability of chars in string
    prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]

    # calculate the entropy
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])

    return entropy


def entropy_ideal(length):
    "Calculates the ideal Shannon entropy of a string with given length"

    prob = 1.0 / length

    return -1.0 * length * prob * math.log(prob) / math.log(2.0)

Note that this implementation assumes that your input bit-stream is best represented as bytes. This may or may not be the case for your problem domain. What you really want is your bitstream converted into a string of numbers. Just how you decide on what those numbers are is domain specific. If your numbers really are just one and zeros, then convert your bitstream into an array of ones and zeros. The conversion method you choose will affect the results you get, however.

篱下浅笙歌 2024-09-11 06:59:16

我相信答案是字符串的Kolmogorov 复杂度
这不仅不能用一大堆伪代码来回答,而且柯尔莫哥洛夫复杂度也不是一个可计算函数

在实践中您可以做的一件事是使用最佳的可用数据压缩算法来压缩位字符串。
压缩得越多,熵就越低。

I believe the answer is the Kolmogorov Complexity of the string.
Not only is this not answerable with a chunk of pseudocode, Kolmogorov complexity is not a computable function!

One thing you can do in practice is compress the bit string with the best available data compression algorithm.
The more it compresses the lower the entropy.

为人所爱 2024-09-11 06:59:16

NIST 随机数生成器评估工具包有一种计算“近似熵”的方法。这是简短的描述:

近似熵测试说明:这个测试的重点是
每个重叠 m 位模式的频率。目的
该测试是比较两个块重叠的频率
与预期结果相对应的连续/相邻长度(m 和 m+1)
为随机序列。

更全面的解释可以从 PDF 在此页面上:

http://csrc.nist.gov/groups/ ST/toolkit/rng/documentation_software.html

The NIST Random Number Generator evaluation toolkit has a way of calculating "Approximate Entropy." Here's the short description:

Approximate Entropy Test Description: The focus of this test is the
frequency of each and every overlapping m-bit pattern. The purpose of
the test is to compare the frequency of overlapping blocks of two
consecutive/adjacent lengths (m and m+1) against the expected result
for a random sequence.

And a more thorough explanation is available from the PDF on this page:

http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html

-残月青衣踏尘吟 2024-09-11 06:59:16

没有单一的答案。熵总是与某些模型相关。当有人谈论具有有限熵的密码时,他们的意思是“相对于智能攻击者的预测能力”,并且它始终是一个上限。

你的问题是,你试图测量熵来帮助你找到一个模型,但这是不可能的;熵测量可以告诉您模型有多好。

话虽如此,您可以尝试一些相当通用的模型;它们被称为压缩算法。如果 gzip 可以很好地压缩您的数据,那么您至少找到了一种可以很好地预测数据的模型。例如,gzip 对简单替换大多不敏感。它可以像处理“the”一样轻松地处理文本中频繁出现的“wkh”。

There is no single answer. Entropy is always relative to some model. When someone talks about a password having limited entropy, they mean "relative to the ability of an intelligent attacker to predict", and it's always an upper bound.

Your problem is, you're trying to measure entropy in order to help you find a model, and that's impossible; what an entropy measurement can tell you is how good a model is.

Having said that, there are some fairly generic models that you can try; they're called compression algorithms. If gzip can compress your data well, you have found at least one model that can predict it well. And gzip is, for example, mostly insensitive to simple substitution. It can handle "wkh" frequently in the text as easily as it can handle "the".

半窗疏影 2024-09-11 06:59:16

通过以下公式使用单词的香农熵: https://i.sstatic.net/GBBJG.jpg< /a>

这是计算它的 O(n) 算法:

import math
from collections import Counter


def entropy(s):
    l = float(len(s))
    return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))

Using Shannon entropy of a word with this formula : https://i.sstatic.net/GBBJG.jpg

Here's a O(n) algorithm that calculates it :

import math
from collections import Counter


def entropy(s):
    l = float(len(s))
    return -sum(map(lambda a: (a/l)*math.log2(a/l), Counter(s).values()))
阪姬 2024-09-11 06:59:16

下面是 Python 中的一个实现(我也将其添加到了 Wiki 页面):

import numpy as np

def ApEn(U, m, r):

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return -(N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return _phi(m) - _phi(m + 1)

示例:

>>> U = np.array([85, 80, 89] * 17)
>>> ApEn(U, 2, 3)
-1.0996541105257052e-05

上面的示例与 维基百科上给出的示例

Here's an implementation in Python (I also added it to the Wiki page):

import numpy as np

def ApEn(U, m, r):

    def _maxdist(x_i, x_j):
        return max([abs(ua - va) for ua, va in zip(x_i, x_j)])

    def _phi(m):
        x = [[U[j] for j in range(i, i + m - 1 + 1)] for i in range(N - m + 1)]
        C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
        return -(N - m + 1.0)**(-1) * sum(np.log(C))

    N = len(U)

    return _phi(m) - _phi(m + 1)

Example:

>>> U = np.array([85, 80, 89] * 17)
>>> ApEn(U, 2, 3)
-1.0996541105257052e-05

The above example is consistent with the example given on Wikipedia.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文