优化 tribool 数组的空间

发布于 2024-10-07 10:44:25 字数 1268 浏览 0 评论 0原文

让我从一些背景开始:

通过“tribool”,我理解一个可以保存以下值之一的变量:truefalsenull

在问题 复制整数数组与布尔指针数组 中, OP 希望有一个尽可能小的 tribool 数组(或多或少)。

通过“一点”最基本的 bit-fu,我想出了一个解决方案,每个 tribool 使用 2 位,并允许将 OP 的 64 个 tribool 数组存储在 16 个字节中,这是可以的。

我使用的 tribool 机制很简单,例如:

  • 布尔值 A 表示“null 或非 null”,
  • 布尔值 B 表示“true 或 false if not null”。

但后来我想...“位”的算法定义是:

是指定两个同等概率事件中哪一个发生的信息量。

显然,真/假值有 1 位大。两个真假值作为一个整体有 2 位大。

那么我们的概念 tribool 呢?

我的观点是:就所包含信息的大小而言,tribool 大于 1 位但小于 2 位

  • 理由 1:假设我们如上所述实现 if 布尔值。如果布尔值A为“null”,则布尔值B的值是多余的并且不携带任何相关信息。
  • 理由2:不可能将来自2个独立布尔值的信息存储在一个tribool中,因此它具有

(以上都不是正式证明,但我相信我们可以同意tribool的“大小”严格大于1 位且严格小于 2。)


我的问题是:

如何以编程方式利用 tribool 的信息少于 2 位这一事实,并在软件中实现(c、c++?)N 个 tribool 的数组,对于某些 N,其内存占用量小于 N/4 字节?

是的,我确实明白这样的实现并不是真正的硬件友好型,并且会比任何具有冗余的常见解决方案执行得更慢(如OP问题中提出的那样)。我们只优化空间,而不是效率。

显然,此实现需要与一对 bool 不同的 tribool 表示(如前所述,这本身是多余的)。该理论表明实现这一目标是可能的,我希望看到实际的实施。有什么想法吗?

Let me start with some background:

By "tribool" I understand a variable which can hold one of the following values: true, false or null.

In question Copying array of ints vs pointers to bools , the OP wanted to have an array of tribools (more or less) which would be as small as possible.

With "a bit of" most basic bit-fu I came up a solution which used 2 bits per tribool and allowed to store the OP's array of 64 tribools in 16 bytes, which is OK.

The tribool mechanics I used were simple, like:

  • boolean A means "null or not null",
  • boolean B means "true or false if not null".

But then I thought... An algorithmical definition of a "bit" is:

A bit is the amount of information which specifies which of two equally probable events shall occur.

Clearly a true/false value is 1 bit big. Two true-false values as a whole are 2 bit big.

And what about our conceptual tribool?

My point is: In terms of the size of contained information, a tribool is bigger than 1 bit but smaller than 2 bits.

  • Justification 1: Assume we implement our if boolean as described above. If boolean A is "null", the value of boolean B is redundant and doesn't carry any relevant information.
  • Justification 2: It's impossible to store information from 2 independent boolean values in one tribool, so it has

(None of the above is a formal proof, but I believe that we can agree that about the "size" of the tribool being strictly bigger than 1 bit and strictly smaller than 2.)


My question is:

How to programatically take advantage of the fact that a tribool has less information than 2 bits, and implement in software (c, c++?) an array of N tribools which would have the memory footprint smaller than N/4 bytes for some N?

Yes, I do understand that such an implementation isn't really hardware-friendly and would perform slower than any common solution with redundance (as those presented in the OP's question). Let's just optimize for space, not for efficiency.

Clearly this implementation needs a different representation of a tribool than a pair of bools (which is by itself redundant, as described before). The theory says it's possible to achieve that goal and I like to see an actual implementation. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

不顾 2024-10-14 10:44:26

你的直觉是正确的,这当然是可能的。这基本上是算术编码的一种形式,或者至少是它的一个简单实例。

最简单的思考方法是想象将“tribools”数组编码为基数为 3 的数字 - 例如 0=FALSE、1=TRUE、2=NULL。然后将以下数组:

{TRUE, FALSE, NULL, NULL, FALSE, FALSE, TRUE}

编码为数字

1022001

,然后可以按正常方式将其转换为十进制:

(1*3^0)+(0*3^1)+(0*3^2)+(2*3^3)+(2*3^4)+(0*3^5)+(1*3^6) = 946

每个 tribool 占用 ln(3)/ln(2) 位(约 1.58),因此使用此方法可以将 20 个 tribool 存储在32 位 - 因此您可以用 4 个字节存储 N=20 数组(其中 N/4 为 5)。

Your intuition is correct, this is certainly possible. This is basically a form of arithmetic coding, or at least a simple instance of it.

The easiest way to think of it is to imagine encoding your array of "tribools" as a number in base 3 - e.g. 0=FALSE, 1=TRUE, 2=NULL. Then the following array:

{TRUE, FALSE, NULL, NULL, FALSE, FALSE, TRUE}

encodes to the number

1022001

which you can then convert to decimal in the normal way:

(1*3^0)+(0*3^1)+(0*3^2)+(2*3^3)+(2*3^4)+(0*3^5)+(1*3^6) = 946

Each tribool takes up ln(3)/ln(2) bits (about 1.58), so using this method you can store 20 tribools in 32 bits - so you can store an N=20 array in 4 bytes (where N/4 is 5).

溺孤伤于心 2024-10-14 10:44:26

理论上,您可以将 X 个 N 状态变量打包到

ln(N^X) / ln M

M 状态变量中(或使用类似 LaTeX 的表示法中的 log_M (N^X))变量。为了以二进制数字存储三态变量,上面的公式变为:

ln(3^N) / ln 2

例如,在 8 位字节中,您可以容纳 5 个三态变量。

当您更密集地打包变量时,解包/修改这些值会变得更加困难和缓慢。在上面的示例中,您必须重新计算整个字节才能更改单个三态变量。

应该注意的是,一个字节包含 5 个三态变量是非常节省空间的。每个字节的密度保持不变,直到你有一个 22 字节的包,它可以容纳 111 个三态值,而不是 110 个。不过,处理这种打包会很混乱。

与直接在一个字节中存储 4 个三态值相比,这些额外的工作值得吗?

You can theoretically pack X N-state variables in

ln(N^X) / ln M

M-state (or log_M (N^X) in LaTeX-like notation) variables. For storing tri-state variables in binary digits the formula above becomes:

ln(3^N) / ln 2

In an 8-bit byte, for example you could fit 5 tri-state variables.

Unpacking/Modifying those values would be a lot harder and slower as you pack variables more densely. In the example above you would have to recalculate the whole byte in order to change a single tri-state variable.

It should be noted that a byte for 5 tri-state variables is quite space-efficient. The density remains the same per-byte, until you have a pack of 22 bytes, which can fit 111 tri-state values, instead of 110. Handling that kind of packing would a mess, though.

Is any of this worth the extra work in comparison to directly storing 4 tri-state values in a byte?

嗳卜坏 2024-10-14 10:44:26

此解决方案要求您预先知道将有多少个“非空”值(即在编译时,或者您是否可以在提供可用空间之前开始计算有多少个非空值)。

然后,您可以按以下方式对其进行编码:

0 表示 null
1 表示非空,后跟 1 或 0 表示真或假。

这将导致每个 tribool 最多 2 位,如果它们都为空,则只有 1 位。

This solution requires you to know up front how many "non-null" values you're going to have (i.e. during compile time, or if you could start counting how many non-nulls there are before making the space available).

You could then encode it the following way:

0 for null
1 for non-null, followed by 1 or 0 for true or false.

This would result in a max of 2 bits per tribool, and just 1 bit if they're all null.

温馨耳语 2024-10-14 10:44:26

对于所有 3 个值均等可能的情况,@psmears 是正确的。
但是,如果它们的可能性不相等,或者不独立,如果您有足够长的字符串,则可以仅使用 2 位或任何其他编码并对其运行 gzip 。这应该将其压缩到理论极限左右。
就像所有值都为 0 的限制一样,它应该不会比字符串长度的对数大很多。

顺便说一句:我们在这里讨论的是熵。这种情况下的简单定义是 -P(0)logP(0) - P(1)logP(1) - P(null)logP(null)。因此,例如,如果 P(0) = P(1) = 1/2,且 P(null) = 0,则熵为 1 位。如果 P(0) = 1/2、P(1) = 1/4、P(null) = 1/4,则熵也是 1/2 * 1 + 1/4 * 2 + 1/4 * 2 = 1 位。如果概率为 1022/1024、1/1024、1/1024,则熵为 (几乎 1)*(几乎 0) + 10/1024 + 10/1024,大约等于 20/1024 或大约 百分之二一点!某件事越确定,它在发生时告诉您的信息就越少,因此所需的存储空间就越少。

@psmears is right, for the case where all 3 values are equally likely.
However, if they were not equally likely, or were not independent, if you had a long enough string of them, you could just use your 2-bit or any other coding and run gzip on it. That should compress it down to about the theoretical limit.
Like in the limit where all the values were 0, it should come out being not much more than the log of the length of the string.

BTW: We're talking about entropy here. A simple definition in this case is -P(0)logP(0) - P(1)logP(1) - P(null)logP(null). So, for example, if P(0) = P(1) = 1/2, and P(null) = 0, then the entropy is 1 bit. If P(0) = 1/2, P(1) = 1/4, P(null) = 1/4, then the entropy is 1/2 * 1 + 1/4 * 2 + 1/4 * 2 also = 1 bit. If the probabilities are 1022/1024, 1/1024, 1/1024, then the entropy is (almost 1)*(almost 0) + 10/1024 + 10/1024 which is about equal to 20/1024 or about 2 hundredths of a bit! The more certain something is, the less it tells you when it occurs, so the less storage it needs.

烂柯人 2024-10-14 10:44:26

我喜欢@psmears提出的解决方案,但它的缺点是它比直接方法慢。您可以使用稍微修改过的版本,它也应该很快:

3**5 == 243,几乎是 256。这意味着您可以轻松地在一个字节中压缩 5 个 tribool 值。它具有相同的压缩比,但由于每个字节是独立的,因此可以使用LUT来实现:

unsigned char get_packed_tribool(unsigned char pk, int num)
{ // num = (0..4), pk = (0..242)
    return LUT[num][pk];    // 5*243 bytes of LUTs
};

unsigned char update_packed_tribool(unsigned char old_pk, int num, int new_val)
{ // new_val = 0..2
    return old_pk + (new_val - LUT[num][old_pk])*POW3_LUT[num];
};

I like the solution proposed by @psmears, but its drawback is that it's slower that the direct approach. You can use a slightly modified version, that should also be fast:

3**5 == 243, that is almost 256. This means that you can easily squeeze 5 tribool values in a byte. It has the same compression ratio, but because each byte is independent, it can be implemented using LUTs:

unsigned char get_packed_tribool(unsigned char pk, int num)
{ // num = (0..4), pk = (0..242)
    return LUT[num][pk];    // 5*243 bytes of LUTs
};

unsigned char update_packed_tribool(unsigned char old_pk, int num, int new_val)
{ // new_val = 0..2
    return old_pk + (new_val - LUT[num][old_pk])*POW3_LUT[num];
};
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文