整数序列的最佳压缩算法
我有一个大数组,其中的整数范围大多是连续的,例如 1-100、110-160 等。所有整数都是正数。 压缩这个的最佳算法是什么?
我尝试了 deflate 算法,但只提供了 50% 的压缩率。 请注意,该算法不能是有损的。
所有数字都是唯一的并且逐渐增加。
另外,如果你能给我指出这种算法的 java 实现,那就太好了。
I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive.
What would be the best algorithm to compress this?
I tried the deflate algorithm but that gives me only 50% compression.
Note that the algorithm cannot be lossy.
All numbers are unique and progressively increasing.
Also if you can point me to the java implementation of such algorithm that would be great.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
我们最近撰写了研究论文,调查了解决此问题的最佳方案。 请参阅:
Daniel Lemire 和 Leonid Boytsov,通过矢量化每秒解码数十亿个整数,软件:实践与应用 经验 45 (1),2015 年。
http://arxiv.org/abs/1209.2137
Daniel Lemire、Nathan Kurz、Leonid Boytsov,SIMD 压缩和排序整数的交集,软件:实践和经验(待出现)http://arxiv.org/abs/1401.6399
其中包括广泛的实验评估。
您可以在线找到 C++11 中所有技术的完整实现:
https://github.com/lemire/FastPFor 和 https://github.com/lemire/SIMDCompressionAndIntersection
还有 C 库:https://github.com/lemire /simdcomp 和 https://github.com/lemire/MaskedVByte
如果您更喜欢 Java,请参阅 https://github.com/lemire/JavaFastPFOR
We have written recent research papers that survey the best schemes for this problem. Please see:
Daniel Lemire and Leonid Boytsov, Decoding billions of integers per second through vectorization,Software: Practice & Experience 45 (1), 2015.
http://arxiv.org/abs/1209.2137
Daniel Lemire, Nathan Kurz, Leonid Boytsov, SIMD Compression and the Intersection of Sorted Integers, Software: Practice and Experience (to appear) http://arxiv.org/abs/1401.6399
They include an extensive experimental evaluation.
You can find a complete implementation of all techniques in C++11 online:
https://github.com/lemire/FastPFor and https://github.com/lemire/SIMDCompressionAndIntersection
There are also C libraries: https://github.com/lemire/simdcomp and https://github.com/lemire/MaskedVByte
If you prefer Java, please see https://github.com/lemire/JavaFastPFOR
首先,通过获取每个值与前一个值之间的差异来预处理值列表(对于第一个值,假设前一个值为零)。 在您的情况下,这应该主要给出一系列序列,大多数压缩算法可以更轻松地对其进行压缩。
这就是 PNG 格式改进其压缩的方式(它采用几种不同的方法之一,然后采用 gzip 使用的相同压缩算法)。
First, preprocess your list of values by taking the difference between each value and the previous one (for the first value, assume the previous one was zero). This should in your case give mostly a sequence of ones, which can be compressed much more easily by most compression algorithms.
This is how the PNG format does to improve its compression (it does one of several difference methods followed by the same compression algorithm used by gzip).
虽然您可以设计特定于数据流的自定义算法,但使用现成的编码算法可能更容易。 我运行了一些Java 中可用的压缩算法测试,发现了以下压缩率对于一百万个连续整数的序列:
While you could design a custom algorithm specific to your stream of data, it's probably easier to use an off the shelf encoding algorithm. I ran a few tests of compression algorithms available in Java and found the following compression rates for a sequence of one million consecutive integers:
好吧,我投票支持更聪明的方式。 在这种情况下,您需要存储的只是 [int:startnumber][int/byte/whatever:number of iterations],您将把示例数组转换为 4xInt 值。 之后你可以根据需要压缩:)
Well, i'm voting for smarter way. All you have to store is [int:startnumber][int/byte/whatever:number of iterations] in this case, you'll turn your example array into 4xInt value. After it you can compress as you want :)
数字的大小是多少? 除了其他答案之外,您还可以考虑使用 base-128 变体长度编码,它允许您在单个字节中存储较小的数字,同时仍然允许较大的数字。 MSB 表示“还有另一个字节” - 这是描述的这里。
将此与其他技术结合起来,这样您就可以存储“跳过大小”,“获取大小”,“跳过大小”,“获取大小” - 但请注意“跳过”和“获取”都不会为零,所以我们将每个减一(这可以让你为少数值节省一个额外的字节)
所以:
是“跳过1”(假设从零开始,因为这样会让事情变得更容易),“取100”,“跳过9”,“取51” ”; 每个减去 1,得到
编码为(十六进制)的(十进制):
如果我们想跳过/采用更大的数字 - 例如 300; 我们减去 1 得到 299 - 但这超过了 7 位; 从小端开始,我们对 7 位块和一个 MSB 进行编码以指示连续:
因此从小端开始:
给出:
因此我们可以轻松地对大数字进行编码,但小数字(这听起来是跳过/获取的典型)需要更少的时间空间。
您可以尝试通过“deflate”运行它,但它可能没有多大帮助...
如果您不想自己处理所有混乱的编码问题...如果您可以创建值的整数数组 (0,99,8,60) - 您可以使用 带有打包重复 uint32/uint64 的协议缓冲区 - 它会为你完成所有工作;-p
我不“做”Java,但这里有一个完整的 C# 实现(借用了一些编码位)我的 protobuf-net 项目):
What size are the numbers? In addition to the other answers, you could consider base-128 variant-length encoding, which lets you store smaller numbers in single bytes while still allowing larger numbers. The MSB means "there is another byte" - this is described here.
Combine this with the other techniques so you are storing "skip size", "take size", "skip size", "take size" - but noting that neither "skip" nor "take" will ever be zero, so we'll subtract one from each (which lets you save an extra byte for a handful of values)
So:
is "skip 1" (assume start at zero as it makes things easier), "take 100", "skip 9", "take 51"; subtract 1 from each, giving (as decimals)
which encodes as (hex):
If we wanted to skip/take a larger number - 300, for example; we subtract 1 giving 299 - but that goes over 7 bits; starting with the little end, we encode blocks of 7 bits and an MSB to indicate continuation:
so starting with the little end:
giving:
So we can encode large numbers easily, but small numbers (which sound typical for skip/take) take less space.
You could try running this through "deflate", but it might not help much more...
If you don't want to deal with all that messy encoding cruff yourself... if you can create the integer-array of the values (0,99,8,60) - you could use protocol buffers with a packed repeated uint32/uint64 - and it'll do all the work for you ;-p
I don't "do" Java, but here's a full C# implementation (borrowing some of the encoding bits from my protobuf-net project):
最快整数压缩,
TurboPFor: Fastest Integer Compression
压缩字符串“1-100, 110-160”或以某种二进制表示形式存储字符串并解析它以恢复数组
compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array
我将结合 CesarB 和 Fernando Miguélez 给出的答案。
首先,存储每个值与前一个值之间的差异。 正如 CesarB 指出的,这将为您提供一系列主要内容。
然后,对此序列使用行程编码压缩算法。 由于有大量重复值,它会很好地压缩。
I'd combine the answers given by CesarB and Fernando Miguélez.
First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.
Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.
除了其他解决方案之外:
您可以找到“密集”区域并使用位图来存储它们。
例如:
如果在 1000-3000 之间的 400 个范围内有 1000 个数字,则可以使用单个位来表示数字的存在,并使用两个整数来表示范围。 此范围的总存储空间为 2000 位 + 2 个整数,因此您可以将该信息存储在 254 字节中,这非常棒,因为即使是短整数也会占用两个字节,因此对于本示例,您可以节省 7 倍。
区域越密集,该算法的效果就越好,但在某些时候,仅存储开始和结束会更便宜。
In addition to the other solutions:
You could find "dense" areas and use a bitmap to store them.
So for example:
If you have 1000 numbers in 400 ranges between 1000-3000, you could use a single bit to denote the existence of a number and two ints to denote the range. Total storage for this range is 2000 bits + 2 ints, so you can store that info in 254bytes, which is pretty awesome since even short integers will take up two bytes each, so for this example you get 7X savings.
The denser the areas the better this algorithm will do, but at some point just storing start and finish will be cheaper.
我知道这是一个旧的消息线程,但我包含了我在这里找到的 SKIP/TAKE 想法的个人 PHP 测试。 我把我的称为STEP(+)/SPAN(-)。 也许有人会发现它有帮助。
注意:我实现了允许重复整数和负整数的功能,即使原始问题涉及正数、非重复整数。 如果您想尝试减少一两个字节,请随意调整它。
代码:
输出:
I know this is an old message thread, but I am including my personal PHP test of the SKIP/TAKE idea I found here. I'm calling mine STEP(+)/SPAN(-). Perhaps someone might find it helpful.
NOTE: I implemented the ability to allow duplicate integers as well as negative integers even though the original question involved positive, non-duplicated integers. Feel free to tweak it if you want to try and shave a byte or two.
CODE:
OUTPUT:
我建议看一下 霍夫曼编码,这是 算术编码。 在这两种情况下,您都需要分析起始序列以确定不同值的相对频率。 较频繁出现的值使用比较不频繁出现的值更少的位进行编码。
I'd suggest taking a look at Huffman Coding, a special case of Arithmetic Coding. In both cases you analyse your starting sequence to determine the relative frequencies of different values. More-frequently-occurring values are encoded with fewer bits than the less-frequently-occurring ones.
为此,我无法让我的压缩比 0.11 更好。 我通过 python 解释器生成了测试数据,它是 1-100 和 110-160 之间的换行符分隔的整数列表。 我使用实际的程序作为数据的压缩表示。 我的压缩文件如下:
它只是一个 Haskell 脚本,生成可以通过以下方式运行的文件:
g.hs 文件的大小是 54 字节,python 生成的数据是 496 字节。 压缩比为 0.10887096774193548。 我认为用更多的时间可以缩小程序,或者可以压缩压缩文件(即 haskell 文件)。
另一种方法可能是保存 4 个字节的数据。 每个序列的最小值和最大值,然后将它们赋予生成函数。 尽管如此,文件的加载将为解压缩器添加更多字符,从而增加解压缩器的复杂性和更多字节。 再次,我通过程序表示了这个非常具体的序列,它没有概括,它是特定于数据的压缩。 此外,增加通用性使得解压缩器变得更大。
另一个问题是必须有 Haskell 解释器才能运行它。 当我编译该程序时,它使它变得更大。 我真的不知道为什么。 python 也有同样的问题,所以也许最好的方法是给出范围,以便某个程序可以解压缩文件。
I couldn't get my compression to be much better than about .11 for this. I generated my test data via python interpreter and it's a newline delimited list of integers from 1-100, and 110-160. I use the actual program as a compressed representation of the data. My compressed file is as follows:
It's just a Haskell script that generates the the file you can run via:
The size of the g.hs file is 54 bytes, and the python generated data is 496 bytes. This gives 0.10887096774193548 as the compression ratio. I think with more time one could shrink the program, or you could compress the compressed file (i.e. the haskell file).
One other approach might be to save 4 bytes of data. The min and max of each sequence, then give those to a generating function. Albeit, the loading of files will add more characters to the decompresser adding more complexity and more bytes to the decompresser. Again, I represented this very specific sequence via a program and it doesn't generalize, it's a compression that's specific to the data. Furthermore, adding generality makes the decompresser larger.
Another concern is that one must have the Haskell interpreter to run this. When I compiled the program it made it much larger. I don't really know why. There is the same problem with python, so maybe the best approach is to give the ranges, so that a some program could decompress the file.
您可能应该使用的基本思想是,对于每个连续整数范围(我将称为这些范围),存储起始数字和范围的大小。 例如,如果您有一个包含 1000 个整数的列表,但只有 10 个单独的范围,则可以仅存储 20 个整数(每个范围 1 个起始数字和 1 个大小)来表示此数据,压缩率为 98 %。 幸运的是,您可以进行更多优化,这将有助于处理范围数量较大的情况。
存储起始编号相对于前一个起始编号的偏移量,而不是起始编号本身。 这里的优点是您存储的数字通常需要更少的位(这可能在以后的优化建议中派上用场)。 此外,如果您只存储起始数字,这些数字都将是唯一的,而存储偏移量则有可能使数字接近甚至重复,这可能允许在之后应用另一种方法进行进一步压缩。
对两种类型的整数使用可能的最小位数。 您可以迭代这些数字以获得起始整数的最大偏移量以及最大范围的大小。 然后,您可以使用最有效地存储这些整数的数据类型,并只需在压缩数据的开头指定数据类型或位数。 例如,如果起始整数的最大偏移量仅为 12,000,最大范围为 9,000 长,那么您可以对所有这些使用 2 字节无符号整数。 然后,您可以将 2,2 对填入压缩数据的开头,以表明两个整数都使用了 2 个字节。 当然,您可以使用一点位操作将此信息放入单个字节中。 如果您愿意进行大量的位操作,则可以将每个数字存储为尽可能少的位数,而不是遵循 1、2、4 或 8 字节表示形式。
通过这两个优化,让我们看几个示例(每个示例为 4,000 字节):
没有优化
优化
The basic idea you should probably use is, for each range of consecutive integers (I will call these ranges), to store the starting number and the size of the range. For example, if you have a list of 1000 integers, but there are only 10 separate ranges, you can store a mere 20 integers (1 start number and 1 size for each range) to represent this data which would be a compression rate of 98%. Fortunately, there are some more optimizations you can make which will help with cases where the number of ranges is larger.
Store the offset of the starting number relative to the previous starting number, rather than the starting number itself. The advantage here is that the numbers you store will generally require less bits (this may come in handy in later optimization suggestions). Additionally, if you only stored the starting numbers, these numbers would all be unique, while storing the offset gives a chance that the numbers are close or even repeat which may allow for even further compression with another method being applied after.
Use the minimum number of bits possible for both types of integers. You can iterate over the numbers to obtain the largest offset of a starting integer as well as the size of the largest range. You can then use a datatype that most efficiently stores these integers and simply specify the datatype or number of bits at the start of the compressed data. For example, if the largest offset of a starting integer is only 12,000, and the largest range is 9,000 long, then you can use a 2 byte unsigned integer for all of these. You could then cram the pair 2,2 at the start of the compressed data to show that 2 bytes is used for both integers. Of course you can fit this information into a single byte using a little bit of bit manipulation. If you are comfortable with doing a lot of heavy bit manipulation you could store each number as the minimum possible amount of bits rather than conforming to 1, 2, 4, or 8 byte representations.
With those two optimizations lets look at a couple of examples (each is 4,000 bytes):
WITHOUT OPTIMIZATIONS
WITH OPTIMIZATIONS
您的情况与搜索引擎中的索引压缩非常相似。 使用的流行的压缩算法是PForDelta算法和Simple16算法。 您可以使用 kamikaze 库来满足您的压缩需求。
Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.
如果您有一系列重复值,RLE 是最容易实现的,并且可以给您带来良好的结果。 尽管如此,其他更先进的算法考虑了熵,例如 LZW(现已无专利),通常可以实现更好的压缩。
您可以在此处查看这些和其他无损算法。
If you have series of repeated values RLE is the easiest to implement and could give you a good result. Nontheless other more advanced algorithms that take into account the entrophy such as LZW, which is now patent-free, can usually achive a much better compression.
You can take a look at these and other lossless algorithms here.