当前位置：文江博客话题详情

缩短整数数组

发布于 2024-12-21 22:12:13 字数 870 浏览 0 评论 0原文

为了避免发明热水，我在这里问...

我有一个包含大量数组的应用程序，但它的内存不足。

因此，我们的想法是将 List 压缩为其他具有相同接口的东西（例如 IList），但不是 int 我可以使用更短的整数。

例如，如果我的值范围是 0 - 100.000.000，我只需要 ln2(1000000) = 20 位。因此，我可以删除多余的部分并将内存需求减少 12/32 = 37.5%，而不是存储 32 位。

你知道这样的数组的实现吗？ c++ 和 java 也可以，因为我可以轻松地将它们转换为 c#。

附加要求（因为每个人都开始让我摆脱这个想法）：

列表中的整数是唯一的，
减少位数
它们没有特殊的属性，因此它们不能以任何其他方式压缩，然后如果值范围是一百万，则例如，列表的大小为 2 到 1000 个元素，但它们的数量会很多，因此 BitSets
新数据容器的行为不应像可调整大小的数组（关于方法 O()-ness）

编辑：

请不要告诉我不要这样做。对此的要求是经过深思熟虑的，它是剩下的唯一选项。

此外，1M 的值范围和 20 位只是一个示例。我的情况具有所有不同的范围和整数大小。

此外，我还可以有更短的整数，例如 7 位整数，然后将

前 4 个元素打包为 5 个字节。

几乎完成了编码 - 很快就会发布......

原文

Just to avoid inventing hot-water, I am asking here...

I have an application with lots of arrays, and it is running out of memory.

So the thought is to compress the List<int> to something else, that would have same interface (IList<T> for example), but instead of int I could use shorter integers.

For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.

Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.

Additional requirements (since everyone is starting to getting me OUT of the idea):

integers in the list ARE unique
they have no special property so they aren't compressible in any other way then reducing the bit count
if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
new data container should behave like re-sizable array (regarding method O()-ness)

EDIT:

Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.

Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.

Also, I could have even shorter integers, for example 7 bit integers, then packing would be

for first 4 elements, packed into 5 bytes.

Almost done coding it - will be posted soon...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

许一世地老天荒 2024-12-28 22:12:13

由于您只能以字节量分配内存，因此您实际上是在问是否/如何将整数放入 3 个字节而不是 4 个字节（但请参见下面的#3）。这不是一个好主意。

由于没有 3 字节大小的整数类型，因此您需要使用其他东西（例如不透明的 3 字节缓冲区）来代替它。这需要您将对列表内容的所有访问包装在执行转换的代码中，以便您仍然可以将“int”放入并拉出“int”。
根据架构和内存分配器，请求 3 字节块可能根本不会影响程序的内存占用（它可能只是在堆上堆满不可用的 1 字节“洞”）。
从头开始重新实现列表以使用不透明字节数组作为其后备存储可以避免前面的两个问题（并且它还可以让您压缩最后一位内存而不是整个字节），但这是一个艰巨的任务并且很容易发生到错误。

您可能想尝试以下操作：

不同时将所有这些数据保留在内存中。如果每个整数 4 个字节，则需要在内存耗尽之前达到数亿个整数。为什么您同时需要所有这些？
如果可能，通过不存储重复项来压缩数据集。上亿的话，肯定有几个。
如果可能的话，更改数据结构，使其存储连续值之间的差异（增量）。这可能不是很难实现，但是您实际上只能期望大约 50% 的改进（这可能还不够），并且它将完全破坏您索引到列表中的能力恒定时间。

回复收藏 0 原文

最初的梦 2024-12-28 22:12:13

将 32 位转换为 24 位的一种选择是创建一个自定义结构，用于存储 3 个字节内的整数：

public struct Entry {
    byte b1; // low
    byte b2; // middle
    byte b3; // high

    public void Set(int x) {
        b1 = (byte)x;
        b2 = (byte)(x >> 8);
        b3 = (byte)(x >> 16);
    }

    public int Get() {
        return (b3 << 16) | (b2 << 8) | b1;
    }
}

然后您可以创建一个 List。

var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312

One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:

public struct Entry {
    byte b1; // low
    byte b2; // middle
    byte b3; // high

    public void Set(int x) {
        b1 = (byte)x;
        b2 = (byte)(x >> 8);
        b3 = (byte)(x >> 16);
    }

    public int Get() {
        return (b3 << 16) | (b2 << 8) | b1;
    }
}

You can then just create a List<Entry>.

var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312

回复收藏 0 原文

山人契 2024-12-28 22:12:13

这让我想起了 base64 和类似的二进制到文本编码。
它们采用 8 位字节，并进行一系列位调整，将它们打包成 4 位、5 位或 6 位可打印字符。
这也让我想起了Zork信息交换标准码（ZSCII），它把3个字母打包成2个字节，每个字母占5位。
听起来您想要获取一堆 10 位或 20 位整数并将它们打包到 8 位字节的缓冲区中。

源代码可用于许多处理单个位的打包数组的库
(a
b
c
d
e）。

也许你可以
(a) 下载该源代码并修改源代码（从某些 BitArray 或其他打包编码开始），重新编译以创建一个新的库，用于处理打包和解包 10 位或 20 位整数而不是单个位。
可能需要更少的编程和测试时间
(b) 编写一个库，从外部看，其行为与 (a) 类似，但在内部它将 20 位整数分解为 20 个单独的位，然后使用（未修改的）BitArray 类存储它们。