压缩已排序的整数

发布于 2024-07-13 11:49:49 字数 142 浏览 6 评论 0原文

我正在构建一个索引,它只是连续存储在二进制文件中的几组有序 32 位整数。 问题是这个文件变得非常大。 我一直在考虑添加一些压缩方案,但这有点超出了我的专业知识。 所以我想知道,在这种情况下哪种压缩算法最有效? 此外,解压必须很快,因为该索引将用于进行弥补查找。

I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

旧人哭 2024-07-20 11:49:49

如果您存储的是靠近的整数(例如:1,3,4,5,9,10等...)而不是一些随机的32位整数(982346...,3487623412..等),您可以这样做一件事:

找到相邻数字之间的差异,例如 2,1,1,4,1... 等(在我们的示例中),然后霍夫曼编码 > 这个数字。

如果您直接将它们应用于您拥有的原始数字列表,我认为霍夫曼编码不起作用。

但是,如果您有一个附近数字的排序列表,那么通过对数字差异进行霍夫曼编码,您将获得非常好的压缩比,这可能比使用 Zip 库中使用的 LZW 算法更好。

无论如何,感谢您发布这个有趣的问题。

If you are storing integers which are close together (eg: 1, 3 ,4, 5, 9, 10 etc... ) rather than some random 32 bit integers (982346..., 3487623412.., etc) you can do one thing:

Find the differences between the adjacent numbers which would be like 2,1,1,4,1... etc.(in our example) and then Huffman encode this numbers.

I don't think Huffman encoding will work if you directly apply them to the original list of numbers you have.

But if you have a sorted list of near-by numbers, the odds are good that you will get a very good compression ratio by doing Huffman encoding of the number differences, may be better ratio than using the LZW algorithm used in the Zip libraries.

Anyway thanks for posting this interesting question.

尬尬 2024-07-20 11:49:49

整数是以密集方式分组还是以稀疏方式分组?

我指的是密集:

[1, 2, 3, 4, 42, 43, 78, 79, 80, 81]

我指的是:

[1, 4, 7, 9, 19, 42, 53, 55, 78, 80]

如果整数以密集方式分组,您可以压缩第一个向量以保存三个范围:

[(1, 4 ), (42, 43), (78, 81)]

这是 40% 的压缩。 当然,该算法在稀疏数据上效果不佳,因为压缩数据将比原始数据多占用 100% 的空间。

Are the integers grouped in a dense way or a sparse way?

By dense I'm referring to:

[1, 2, 3, 4, 42, 43, 78, 79, 80, 81]

By sparse I'm referring to:

[1, 4, 7, 9, 19, 42, 53, 55, 78, 80]

If the integers are grouped in a dense way you could compress the first vector to hold three ranges:

[(1, 4), (42, 43), (78, 81)]

Which is a 40% compression. Of course this algorithm does not work well on sparse data as the compressed data would take up 100% more space than the original data.

演多会厌 2024-07-20 11:49:49

正如您所发现的,N 个 32 位整数的排序序列没有 32*N 位的数据。 这并不奇怪。 假设没有重复项,对于每个排序序列有 N! 包含相同整数的未排序序列。

现在,如何利用排序序列中的有限信息? 许多压缩算法的压缩基于对常见输入值使用较短的位串(霍夫曼仅使用此技巧)。 一些发帖者已经建议计算数字之间的差异,并压缩这些差异。 他们假设这将是一系列小数字,其中许多数字是相同的。 在这种情况下,大多数算法都会很好地压缩差分序列。

然而,以斐波那契数列为例。 这绝对是排序的整数。 F(n) 和 F(n+1) 之间的差是 F(n-1)。 因此,压缩差异序列相当于压缩序列本身 - 它根本没有帮助!

因此,我们真正需要的是输入数据的统计模型。 给定序列 N[0]...N[x],N[x+1] 的概率分布是什么? 我们知道 P(N[x+1] < N[x]) = 0,因为序列已排序。 提出的基于微分/霍夫曼的解决方案之所以有效,是因为它们假设 P(N[x+1] - N[x] = d) 对于较小的正 d 来说相当高,并且与 x 无关,因此它们可以使用几个位来表示微小的差异。 如果您可以提供另一个模型,您可以对其进行优化。

As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.

Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.

However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!

So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.

歌入人心 2024-07-20 11:49:49

如果您需要快速随机访问查找,那么差异的霍夫曼编码(如 Niyaz 所建议)只是故事的一半。 您可能还需要某种分页/索引方案,以便轻松提取第 n 个数字。

如果您不这样做,那么提取第 n 个数字是一个 O(n) 操作,因为您必须读取并霍夫曼解码一半文件才能找到您要查找的数字。 您必须仔细选择页面大小,以平衡存储页面偏移量的开销和查找速度。

If you need fast random-access lookup, then a Huffman-encoding of the differences (as suggested by Niyaz) is only half the story. You will probably also need some sort of paging/indexing scheme so that it is easy to extract the nth number.

If you don't do this, then extracting the nth number is an O(n) operation, as you have to read and Huffman decode half the file before you can find the number you were after. You have to choose the page size carefully to balance the overhead of storing page offsets against the speed of lookup.

江挽川 2024-07-20 11:49:49

MSalters 的答案很有趣,但如果您分析不当,可能会分散您的注意力。 32 位的斐波那契数列只有 47 个。

但他很清楚如何通过分析一系列增量来找到要压缩的模式来正确解决问题。

重要的事情: a) 是否有重复的值? 如果是这样,多久一次? (如果重要,则将其作为压缩的一部分,如果不重要,则将其作为例外。) b) 它看起来是准随机的吗? 这也可能是好的,因为可以找到合适的平均增量。

MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.

But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.

Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.

倒数 2024-07-20 11:49:49

整数列表上的条件略有不同,但是
问题压缩独特的数据流提出了几种方法,可以帮你。

我建议将数据预过滤为 start 和一系列 offset。 如果您知道偏移量确实很小,您甚至可以将它们编码为 1 或 2 字节数量而不是 4 字节。 如果您不知道这一点,每个偏移量仍可能是 4 个字节,但由于它们的差异很小,因此您将获得比存储原始整数更多的重复次数。

预过滤后,通过您选择的压缩方案运行输出 - 在字节级别工作的东西,如 gzip 或 zlib,可能会做得非常好。

The conditions on the lists of integers is slightly different, but
the question Compression for a unique stream of data suggests several approaches which could help you.

I'd suggest prefiltering the data into a start and a series of offsets. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.

After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.

雪花飘飘的天空 2024-07-20 11:49:49

我想 霍夫曼编码 非常适合此目的(并且与其他算法相比相对较快)具有相似的压缩比)。

编辑:我的回答只是一个一般性的指示。 Niyaz 对连续数字之间的差异进行编码的建议是一个很好的建议。 (但是,如果列表排序或数字间距非常不规则,我认为使用普通的霍夫曼编码也同样有效。事实上,在这种情况下,LZW 或类似的可能是最好的,尽管可能仍然不是很好。)

I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).

EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)

风吹雨成花 2024-07-20 11:49:49

在投资你自己的计划之前,我会使用现成的标准。

例如,在 Java 中,您可以使用 GZIPOutputStream 应用 gzip 压缩。

I'd use something bog standard off the shelf before investing in your own scheme.

In Java for example you can use GZIPOutputStream to apply gzip compression.

不美如何 2024-07-20 11:49:49

也许您可以将连续 32 位整数之间的差异存储为 16 位整数。

Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.

回忆凄美了谁 2024-07-20 11:49:49

可靠且有效的解决方案是应用 pcodec (https://github.com/mwlon/pcodec) 。 如果合适,pcodec 会自动获取增量,然后接近这些增量的平滑分布的香农熵。 在好的硬件上,其解压速度通常在单核2GB/s以上。

A reliable and effective solution is to apply pcodec (https://github.com/mwlon/pcodec). pcodec automatically takes deltas if appropriate, then gets close to the Shannon Entropy of a smooth distribution of those deltas. Its decompression speed is usually over 2GB/s on a single core of good hardware.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文