更好的矢量数据压缩算法?
我需要压缩一些空间相关的数据记录。目前,我使用 zlib 获得了 1.2x-1.5x 的压缩,但我认为应该可以达到 2x 左右。数据记录有各种字段,但例如,zlib 似乎在压缩点列表时遇到问题。
这些点代表道路网络。它们是 XXXXYYYY 形式的定点 4 字节整数对。通常,如果单个数据块有 100 个点,则 X 和 Y 的前两个字节(空间相关性)只会有几种组合。但底部字节总是在变化,并且对于 zlib 来说必须看起来像随机数据。
类似地,记录具有 4 字节 ID,往往具有恒定的高字节和可变的低字节。
是否有另一种算法能够更好地压缩此类数据?我正在使用 C++。
编辑:请不要再提出更改数据本身的建议。我的问题是关于自动压缩算法。如果有人有所有流行压缩算法概述的链接,我会接受它作为答案。
I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.
The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.
Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.
Is there another algorithm that would be able to compress this kind of data better? I'm using C++.
Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果您尝试根据您对数据结构的了解自行压缩数据,您可能会获得更好的结果。
通用压缩算法只是将您的数据视为比特流。他们寻找常用的位序列,并用较短的字典索引替换它们。
但重复的数据并没有消失。重复的序列变得更短,但它仍然像以前一样频繁地重复。
据我了解,您有大量
XXxxYYyy 形式的数据点,其中大写字母非常统一。所以把它们排除掉。
将列表重写为与此类似的内容:
现在,很少变化的字节的每个组合仅列出一次,而不是为它们出现的每个条目重复列出。这总共可以节省大量空间。
基本上,在通过 zlib 运行之前尝试自己删除重复数据。您可以做得更好,因为您对数据有了更多了解。
另一种方法可能是,不将这些坐标存储为绝对数字,而是将它们写为增量,即与选择尽可能接近所有条目的某个位置的相对偏差。您的增量将是较小的数字,可以使用更少的位来存储。
You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.
General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.
But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.
As I understand it, you have a large number of data points of the form
XXxxYYyy, where the upper-case letters are very uniform. So factor them out.
Rewrite the list as something similar to this:
Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.
Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.
Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.
不特定于您的数据,但如果可以的话,我建议您检查 7zip 而不是 zlib。我已经看到使用它的压缩比非常好。
http://www.7-zip.org/
Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.
http://www.7-zip.org/
在没有看到数据及其确切分布的情况下,我无法确定最好的方法是什么,但我建议您以一个字节开始每组 1-4 条记录,该字节的 8 位表示以下内容:
0-1 数字应从先前记录借用的 ID 字节数
2-4 持仓记录格式
6-7 使用相同“模式”字节的后续记录的数量
每个位置记录可以以八种方式之一存储;除 000 之外的所有类型都使用有符号位移。位代码后面的数字是位置记录的大小。
000 - 8 - 两个完整的四字节位置
001 - 3 - X 和 Y 的十二位
010 - 2 - 十位 X 和六位 Y
011 - 2 - 六位X和十位Y
100 - 4 - 两个十六位有符号位移
101 - 3 - 16 位 X 和 8 位 Y 有符号位移
110 - 3 - X 的八位有符号位移; Y 为 16 位
111 - 2 - 两个八位有符号位移
模式字节为零将存储适用于某个点的所有信息,而不参考任何先前的点,总共使用 13 个字节来存储 12 个字节的有用信息。其他模式字节将允许根据与先前记录的相似性来压缩记录。如果四个连续记录仅 ID 的最后一位不同,并且 X 和 Y 均在前一条记录的 +/- 127 范围内,或者 X 在 +/- 31 范围内且 Y 在 +/- 511 范围内,或者 X 在+/- 511 且 Y 在 +/- 31 之内,则所有四个记录可以存储在 13 个字节中(平均每个记录 3.25 个字节(空间减少 73%)。
可以使用“贪婪”算法进行压缩:检查一条记录以查看在输出中必须使用什么大小的 ID 和 XY,然后再获取最多三个记录,直到发现一条记录无法与使用所选大小的先前记录“匹配”,或者可以写得更小(请注意,如果第一个记录的 X 和 Y 位移都等于 12,则 XY 将用两个字节写入,但直到读取后续记录为止,我们不知道要使用三种两字节格式中的哪一种)。
在确定格式之前,我建议通过它来运行数据。小的调整(例如使用 7+9 或 5+11 位格式而不是 6+10)可能会允许打包许多数据。更好的。然而,唯一真正了解的方法是看看你的真实数据会发生什么。
Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:
0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte
Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.
000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements
A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).
A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).
Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.
看起来 Burrows–Wheeler 变换可能对解决此问题有用。它有一种将重复字节放在一起的特殊倾向,这可能会使 zlib 压缩得更好。不过,本文建议我应该将 zlib 之外的其他算法与 BWT 结合起来。
直观上,这听起来很昂贵,但查看一些源代码表明,反向 BWT 的复杂度为 O(N),需要 3 次数据传递和适度的空间开销,这可能使其在我的目标平台 (WinCE) 上足够快。假设采用普通排序算法,正向变换大约为 O(N log N) 或稍稍超过 O(N log N)。
It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.
Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.
通过某种邻近度度量对点进行排序,使得相邻点之间的平均距离较小。然后存储相邻点之间的差异。
如果您设法对点进行排序,以便大多数差异在 x 轴和 y 轴上均为正值,您可能会做得更好,但我不能肯定地说。
作为 zlib 的替代方案,当概率分布偏向小数时效果很好的一系列压缩技术是 通用代码。它们必须针对有符号数字进行调整(编码
abs(x)<<1 + (x < 0 ? 1 : 0)
)。Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.
You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.
As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode
abs(x)<<1 + (x < 0 ? 1 : 0)
).您可能想要将两个列表写入压缩文件:一个
NodeList
和一个LinkList
。每个节点都有一个 ID、x、y。每个链接都有一个 FromNode 和一个 ToNode,以及中间 xy 值的列表。您可能能够拥有一个具有错误来源的标头记录,并具有与之相关的节点 xy 值。如果您的街道遵循城市网格网络,通过消除交叉口的重复坐标,这将提供最大的好处。
如果不需要无损压缩,则可以使用截断的增量作为中间坐标。虽然上面有人提到了增量,但请记住,连接丢失可能会比形状丢失引起更多问题,如果您使用截断的增量来表示道路的最后一个坐标,就会发生这种情况(这通常是一个交叉点)。
同样,如果您的道路不在城市网格中,这可能不会给您带来太多好处。
You might want to write two lists to the compressed file: a
NodeList
and aLinkList
. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.
If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).
Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.