当前位置：文江博客话题详情

优化 tribool 数组的空间

发布于 2024-10-07 10:44:25 字数 1268 浏览 3 评论 0原文

让我从一些背景开始：

通过“tribool”，我理解一个可以保存以下值之一的变量：true、false 或 null 。

在问题复制整数数组与布尔指针数组中， OP 希望有一个尽可能小的 tribool 数组（或多或少）。

通过“一点”最基本的 bit-fu，我想出了一个解决方案，每个 tribool 使用 2 位，并允许将 OP 的 64 个 tribool 数组存储在 16 个字节中，这是可以的。

我使用的 tribool 机制很简单，例如：

布尔值 A 表示“null 或非 null”，
布尔值 B 表示“true 或 false if not null”。

但后来我想...“位”的算法定义是：

位是指定两个同等概率事件中哪一个发生的信息量。

显然，真/假值有 1 位大。两个真假值作为一个整体有 2 位大。

那么我们的概念 tribool 呢？

我的观点是：就所包含信息的大小而言，tribool 大于 1 位但小于 2 位。

理由 1：假设我们如上所述实现 if 布尔值。如果布尔值A为“null”，则布尔值B的值是多余的并且不携带任何相关信息。
理由2：不可能将来自2个独立布尔值的信息存储在一个tribool中，因此它具有

（以上都不是正式证明，但我相信我们可以同意tribool的“大小”严格大于1 位且严格小于 2。）

我的问题是：

如何以编程方式利用 tribool 的信息少于 2 位这一事实，并在软件中实现（c、c++？）N 个 tribool 的数组，对于某些 N，其内存占用量小于 N/4 字节？

是的，我确实明白这样的实现并不是真正的硬件友好型，并且会比任何具有冗余的常见解决方案执行得更慢（如OP问题中提出的那样）。我们只优化空间，而不是效率。

显然，此实现需要与一对 bool 不同的 tribool 表示（如前所述，这本身是多余的）。该理论表明实现这一目标是可能的，我希望看到实际的实施。有什么想法吗？

原文

Let me start with some background:

By "tribool" I understand a variable which can hold one of the following values: true, false or null.

In question Copying array of ints vs pointers to bools , the OP wanted to have an array of tribools (more or less) which would be as small as possible.

With "a bit of" most basic bit-fu I came up a solution which used 2 bits per tribool and allowed to store the OP's array of 64 tribools in 16 bytes, which is OK.

The tribool mechanics I used were simple, like:

boolean A means "null or not null",
boolean B means "true or false if not null".

But then I thought... An algorithmical definition of a "bit" is:

A bit is the amount of information which specifies which of two equally probable events shall occur.

Clearly a true/false value is 1 bit big. Two true-false values as a whole are 2 bit big.

And what about our conceptual tribool?

My point is: In terms of the size of contained information, a tribool is bigger than 1 bit but smaller than 2 bits.

Justification 1: Assume we implement our if boolean as described above. If boolean A is "null", the value of boolean B is redundant and doesn't carry any relevant information.
Justification 2: It's impossible to store information from 2 independent boolean values in one tribool, so it has

(None of the above is a formal proof, but I believe that we can agree that about the "size" of the tribool being strictly bigger than 1 bit and strictly smaller than 2.)

My question is:

How to programatically take advantage of the fact that a tribool has less information than 2 bits, and implement in software (c, c++?) an array of N tribools which would have the memory footprint smaller than N/4 bytes for some N?

Yes, I do understand that such an implementation isn't really hardware-friendly and would perform slower than any common solution with redundance (as those presented in the OP's question). Let's just optimize for space, not for efficiency.

Clearly this implementation needs a different representation of a tribool than a pair of bools (which is by itself redundant, as described before). The theory says it's possible to achieve that goal and I like to see an actual implementation. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不顾 2024-10-14 10:44:26

你的直觉是正确的，这当然是可能的。这基本上是算术编码的一种形式，或者至少是它的一个简单实例。

最简单的思考方法是想象将“tribools”数组编码为基数为 3 的数字 - 例如 0=FALSE、1=TRUE、2=NULL。然后将以下数组：

{TRUE, FALSE, NULL, NULL, FALSE, FALSE, TRUE}

编码为数字

，然后可以按正常方式将其转换为十进制：

(1*3^0)+(0*3^1)+(0*3^2)+(2*3^3)+(2*3^4)+(0*3^5)+(1*3^6) = 946

每个 tribool 占用 ln(3)/ln(2) 位（约 1.58），因此使用此方法可以将 20 个 tribool 存储在32 位 - 因此您可以用 4 个字节存储 N=20 数组（其中 N/4 为 5）。

Your intuition is correct, this is certainly possible. This is basically a form of arithmetic coding, or at least a simple instance of it.

The easiest way to think of it is to imagine encoding your array of "tribools" as a number in base 3 - e.g. 0=FALSE, 1=TRUE, 2=NULL. Then the following array:

{TRUE, FALSE, NULL, NULL, FALSE, FALSE, TRUE}

encodes to the number

which you can then convert to decimal in the normal way:

(1*3^0)+(0*3^1)+(0*3^2)+(2*3^3)+(2*3^4)+(0*3^5)+(1*3^6) = 946

Each tribool takes up ln(3)/ln(2) bits (about 1.58), so using this method you can store 20 tribools in 32 bits - so you can store an N=20 array in 4 bytes (where N/4 is 5).

回复收藏 0 原文

溺孤伤于心 2024-10-14 10:44:26

理论上，您可以将 X 个 N 状态变量打包到

ln(N^X) / ln M

M 状态变量中（或使用类似 LaTeX 的表示法中的 log_M (N^X)）变量。为了以二进制数字存储三态变量，上面的公式变为：

ln(3^N) / ln 2

例如，在 8 位字节中，您可以容纳 5 个三态变量。

当您更密集地打包变量时，解包/修改这些值会变得更加困难和缓慢。在上面的示例中，您必须重新计算整个字节才能更改单个三态变量。

应该注意的是，一个字节包含 5 个三态变量是非常节省空间的。每个字节的密度保持不变，直到你有一个 22 字节的包，它可以容纳 111 个三态值，而不是 110 个。不过，处理这种打包会很混乱。

与直接在一个字节中存储 4 个三态值相比，这些额外的工作值得吗？

You can theoretically pack X N-state variables in

ln(N^X) / ln M

M-state (or log_M (N^X) in LaTeX-like notation) variables. For storing tri-state variables in binary digits the formula above becomes:

ln(3^N) / ln 2

In an 8-bit byte, for example you could fit 5 tri-state variables.

Unpacking/Modifying those values would be a lot harder and slower as you pack variables more densely. In the example above you would have to recalculate the whole byte in order to change a single tri-state variable.

It should be noted that a byte for 5 tri-state variables is quite space-efficient. The density remains the same per-byte, until you have a pack of 22 bytes, which can fit 111 tri-state values, instead of 110. Handling that kind of packing would a mess, though.

Is any of this worth the extra work in comparison to directly storing 4 tri-state values in a byte?

回复收藏 0 原文

嗳卜坏 2024-10-14 10:44:26

此解决方案要求您预先知道将有多少个“非空”值（即在编译时，或者您是否可以在提供可用空间之前开始计算有多少个非空值）。

然后，您可以按以下方式对其进行编码：

0 表示 null
1 表示非空，后跟 1 或 0 表示真或假。

这将导致每个 tribool 最多 2 位，如果它们都为空，则只有 1 位。

回复收藏 0 原文

温馨耳语 2024-10-14 10:44:26

对于所有 3 个值均等可能的情况，@psmears 是正确的。
但是，如果它们的可能性不相等，或者不独立，如果您有足够长的字符串，则可以仅使用 2 位或任何其他编码并对其运行 gzip 。这应该将其压缩到理论极限左右。
就像所有值都为 0 的限制一样，它应该不会比字符串长度的对数大很多。

顺便说一句：我们在这里讨论的是熵。这种情况下的简单定义是 -P(0)logP(0) - P(1)logP(1) - P(null)logP(null)。因此，例如，如果 P(0) = P(1) = 1/2，且 P(null) = 0，则熵为 1 位。如果 P(0) = 1/2、P(1) = 1/4、P(null) = 1/4，则熵也是 1/2 * 1 + 1/4 * 2 + 1/4 * 2 = 1 位。如果概率为 1022/1024、1/1024、1/1024，则熵为 (几乎 1)*(几乎 0) + 10/1024 + 10/1024，大约等于 20/1024 或大约 百分之二一点！某件事越确定，它在发生时告诉您的信息就越少，因此所需的存储空间就越少。

回复收藏 0 原文

烂柯人 2024-10-14 10:44:26

我喜欢@psmears提出的解决方案，但它的缺点是它比直接方法慢。您可以使用稍微修改过的版本，它也应该很快：

3**5 == 243，几乎是 256。这意味着您可以轻松地在一个字节中压缩 5 个 tribool 值。它具有相同的压缩比，但由于每个字节是独立的，因此可以使用LUT来实现：

unsigned char get_packed_tribool(unsigned char pk, int num)
{ // num = (0..4), pk = (0..242)
    return LUT[num][pk];    // 5*243 bytes of LUTs
};

unsigned char update_packed_tribool(unsigned char old_pk, int num, int new_val)
{ // new_val = 0..2
    return old_pk + (new_val - LUT[num][old_pk])*POW3_LUT[num];
};

I like the solution proposed by @psmears, but its drawback is that it's slower that the direct approach. You can use a slightly modified version, that should also be fast:

3**5 == 243, that is almost 256. This means that you can easily squeeze 5 tribool values in a byte. It has the same compression ratio, but because each byte is independent, it can be implemented using LUTs:

unsigned char get_packed_tribool(unsigned char pk, int num)
{ // num = (0..4), pk = (0..242)
    return LUT[num][pk];    // 5*243 bytes of LUTs
};

unsigned char update_packed_tribool(unsigned char old_pk, int num, int new_val)
{ // new_val = 0..2
    return old_pk + (new_val - LUT[num][old_pk])*POW3_LUT[num];
};

回复收藏 0 原文

~没有更多了~