当前位置：文江博客话题详情

更好的矢量数据压缩算法？

发布于 2024-09-06 20:50:43 字数 404 浏览 1 评论 0原文

我需要压缩一些空间相关的数据记录。目前，我使用 zlib 获得了 1.2x-1.5x 的压缩，但我认为应该可以达到 2x 左右。数据记录有各种字段，但例如，zlib 似乎在压缩点列表时遇到问题。

这些点代表道路网络。它们是 XXXXYYYY 形式的定点 4 字节整数对。通常，如果单个数据块有 100 个点，则 X 和 Y 的前两个字节（空间相关性）只会有几种组合。但底部字节总是在变化，并且对于 zlib 来说必须看起来像随机数据。

类似地，记录具有 4 字节 ID，往往具有恒定的高字节和可变的低字节。

是否有另一种算法能够更好地压缩此类数据？我正在使用 C++。

编辑：请不要再提出更改数据本身的建议。我的问题是关于自动压缩算法。如果有人有所有流行压缩算法概述的链接，我会接受它作为答案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓦然回首 2024-09-13 20:50:44

如果您尝试根据您对数据结构的了解自行压缩数据，您可能会获得更好的结果。

通用压缩算法只是将您的数据视为比特流。他们寻找常用的位序列，并用较短的字典索引替换它们。

但重复的数据并没有消失。重复的序列变得更短，但它仍然像以前一样频繁地重复。

据我了解，您有大量

XXxxYYyy 形式的数据点，其中大写字母非常统一。所以把它们排除掉。

将列表重写为与此类似的内容：

XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...

现在，很少变化的字节的每个组合仅列出一次，而不是为它们出现的每个条目重复列出。这总共可以节省大量空间。

基本上，在通过 zlib 运行之前尝试自己删除重复数据。您可以做得更好，因为您对数据有了更多了解。

另一种方法可能是，不将这些坐标存储为绝对数字，而是将它们写为增量，即与选择尽可能接近所有条目的某个位置的相对偏差。您的增量将是较小的数字，可以使用更少的位来存储。

You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.

General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.

But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.

As I understand it, you have a large number of data points of the form

XXxxYYyy, where the upper-case letters are very uniform. So factor them out.

Rewrite the list as something similar to this:

XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...

Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.

Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.

Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.

回复收藏 0 原文

抹茶夏天i‖ 2024-09-13 20:50:44

不特定于您的数据，但如果可以的话，我建议您检查 7zip 而不是 zlib。我已经看到使用它的压缩比非常好。

http://www.7-zip.org/

回复收藏 0 原文

未蓝澄海的烟 2024-09-13 20:50:44

在没有看到数据及其确切分布的情况下，我无法确定最好的方法是什么，但我建议您以一个字节开始每组 1-4 条记录，该字节的 8 位表示以下内容：

0-1 数字应从先前记录借用的 ID 字节数
2-4 持仓记录格式
6-7 使用相同“模式”字节的后续记录的数量

每个位置记录可以以八种方式之一存储；除 000 之外的所有类型都使用有符号位移。位代码后面的数字是位置记录的大小。

000 - 8 - 两个完整的四字节位置
001 - 3 - X 和 Y 的十二位
010 - 2 - 十位 X 和六位 Y
011 - 2 - 六位X和十位Y
100 - 4 - 两个十六位有符号位移
101 - 3 - 16 位 X 和 8 位 Y 有符号位移
110 - 3 - X 的八位有符号位移； Y 为 16 位
111 - 2 - 两个八位有符号位移

模式字节为零将存储适用于某个点的所有信息，而不参考任何先前的点，总共使用 13 个字节来存储 12 个字节的有用信息。其他模式字节将允许根据与先前记录的相似性来压缩记录。如果四个连续记录仅 ID 的最后一位不同，并且 X 和 Y 均在前一条记录的 +/- 127 范围内，或者 X 在 +/- 31 范围内且 Y 在 +/- 511 范围内，或者 X 在+/- 511 且 Y 在 +/- 31 之内，则所有四个记录可以存储在 13 个字节中（平均每个记录 3.25 个字节（空间减少 73%）。

可以使用“贪婪”算法进行压缩：检查一条记录以查看在输出中必须使用什么大小的 ID 和 XY，然后再获取最多三个记录，直到发现一条记录无法与使用所选大小的先前记录“匹配”，或者可以写得更小（请注意，如果第一个记录的 X 和 Y 位移都等于 12，则 XY 将用两个字节写入，但直到读取后续记录为止，我们不知道要使用三种两字节格式中的哪一种）。

在确定格式之前，我建议通过它来运行数据。小的调整（例如使用 7+9 或 5+11 位格式而不是 6+10）可能会允许打包许多数据。更好的。然而，唯一真正了解的方法是看看你的真实数据会发生什么。

Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:

0-1 Number of bytes of ID that should be borrowed from previous record
2-4 Format of position record
6-7 Number of succeeding records that use the same 'mode' byte

Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.

000 - 8 - Two full four-byte positions
001 - 3 - Twelve bits for X and Y
010 - 2 - Ten-bit X and six-bit Y
011 - 2 - Six-bit X and ten-bit Y
100 - 4 - Two sixteen-bit signed displacements
101 - 3 - Sixteen-bit X and 8-bit Y signed displacement
110 - 3 - Eight-bit signed displacement for X; 16-bit for Y
111 - 2 - Two eight-bit signed displacements

A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).

A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).

Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.

回复收藏 0 原文