缩短整数数组
为了避免发明热水,我在这里问...
我有一个包含大量数组的应用程序,但它的内存不足。
因此,我们的想法是将 List
压缩为其他具有相同接口的东西(例如 IList
),但不是 int
我可以使用更短的整数。
例如,如果我的值范围是 0 - 100.000.000,我只需要 ln2(1000000) = 20 位。因此,我可以删除多余的部分并将内存需求减少 12/32 = 37.5%,而不是存储 32 位。
你知道这样的数组的实现吗? c++ 和 java 也可以,因为我可以轻松地将它们转换为 c#。
附加要求(因为每个人都开始让我摆脱这个想法):
- 列表中的整数是唯一的,
- 减少位数
- 它们没有特殊的属性,因此它们不能以任何其他方式压缩,然后如果值范围是一百万,则 例如,列表的大小为 2 到 1000 个元素,但它们的数量会很多,因此 BitSets
- 新数据容器的行为不应像可调整大小的数组(关于方法 O()-ness)
编辑:
请不要告诉我不要这样做。对此的要求是经过深思熟虑的,它是剩下的唯一选项。
此外,1M 的值范围和 20 位只是一个示例。我的情况具有所有不同的范围和整数大小。
此外,我还可以有更短的整数,例如 7 位整数,然后将
00000001
11111122
22222333
33334444
444.....
前 4 个元素打包为 5 个字节。
几乎完成了编码 - 很快就会发布......
Just to avoid inventing hot-water, I am asking here...
I have an application with lots of arrays, and it is running out of memory.
So the thought is to compress the List<int>
to something else, that would have same interface (IList<T>
for example), but instead of int
I could use shorter integers.
For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.
Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.
Additional requirements (since everyone is starting to getting me OUT of the idea):
- integers in the list ARE unique
- they have no special property so they aren't compressible in any other way then reducing the bit count
- if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
- new data container should behave like re-sizable array (regarding method O()-ness)
EDIT:
Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.
Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.
Also, I could have even shorter integers, for example 7 bit integers, then packing would be
00000001
11111122
22222333
33334444
444.....
for first 4 elements, packed into 5 bytes.
Almost done coding it - will be posted soon...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
由于您只能以字节量分配内存,因此您实际上是在问是否/如何将整数放入 3 个字节而不是 4 个字节(但请参见下面的#3)。这不是一个好主意。
您可能想尝试以下操作:
Since you can only allocate memory in byte quantums, you are essentially asking if/how you can fit the integers in 3 bytes instead of 4 (but see #3 below). This is not a good idea.
You might want instead to try something like:
将 32 位转换为 24 位的一种选择是创建一个自定义结构,用于存储 3 个字节内的整数:
然后您可以创建一个
List
。One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:
You can then just create a
List<Entry>
.这让我想起了 base64 和类似的二进制到文本编码。
它们采用 8 位字节,并进行一系列位调整,将它们打包成 4 位、5 位或 6 位可打印字符。
这也让我想起了Zork信息交换标准码(ZSCII),它把3个字母打包成2个字节,每个字母占5位。
听起来您想要获取一堆 10 位或 20 位整数并将它们打包到 8 位字节的缓冲区中。
源代码可用于许多处理单个位的打包数组的库
(a
b
c
d
e)。
也许你可以
(a) 下载该源代码并修改源代码(从某些 BitArray 或其他打包编码开始),重新编译以创建一个新的库,用于处理打包和解包 10 位或 20 位整数而不是单个位。
可能需要更少的编程和测试时间
(b) 编写一个库,从外部看,其行为与 (a) 类似,但在内部它将 20 位整数分解为 20 个单独的位,然后使用(未修改的)BitArray 类存储它们。
This reminds me of base64 and similar kinds of binary-to-text encoding.
They take 8 bit bytes and do a bunch of bit-fiddling to pack them into 4-, 5-, or 6-bit printable characters.
This also reminds me of the Zork Standard Code for Information Interchange (ZSCII), which packs 3 letters into 2 bytes, where each letter occupies 5 bits.
It sounds like you want to taking a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.
The source code is available for many libraries that handle a packed array of single bits
(a
b
c
d
e).
Perhaps you could
(a) download that source code and modify the source (starting from some BitArray or other packed encoding), recompiling to create a new library that handles packing and unpacking 10- or 20-bit integers rather than single bits.
It may take less programming and testing time to
(b) write a library that, from the outside, appears to act just like (a), but internally it breaks up 20-bit integers into 20 separate bits, then stores them using an (unmodified) BitArray class.
编辑:鉴于您的整数是唯一的,您可以执行以下操作:存储唯一的整数,直到您存储的整数数量是最大数量的一半。然后切换到存储您没有的整数。这将减少 50% 的存储空间。
在尝试使用 20 位整数之前,可能值得探索其他简化技术。
如何处理重复的整数?如果有大量重复的整数,您可以通过将整数存储在
Dictionary
中来减少存储大小,其中键是唯一的整数,并且值是相应的计数。请注意,这假设您不关心整数的顺序。您的整数都是唯一的吗?也许您存储了许多 0 到 1 亿范围内的唯一整数。在这种情况下,您可以尝试存储您没有的整数。然后,当确定您是否有一个整数
i
时,只需询问它是否不在您的集合中。Edit: Given that your integers are unique you could do the following: store unique integers until the number of integers you're storing is half the maximum number. Then switch to storing the integers you don't have. This will reduce the storage space by 50%.
Might be worth exploring other simplification techniques before trying to use 20-bit ints.
How do you treat duplicate integers? If have lots of duplicates you could reduce the storage size by storing the integers in a
Dictionary<int, int>
where keys are unique integers and values are corresponding counts. Note this assumes you don't care about the order of your integers.Are your integers all unique? Perhaps you're storing lots of unique integers in the range 0 to 100 mil. In this case you could try storing the integers you don't have. Then when determining if you have an integer
i
just ask if it's not in your collection.