C 中 ASCII 字符串的压缩

发布于 2024-07-26 21:01:56 字数 228 浏览 6 评论 0 原文

我有一些 C 代码,将 ASCII 字符串作为四字节长度存储在内存中,后跟字符串。 字符串长度范围为 10-250 字节。

为了减少占用,我想动态地单独压缩每个字符串,仍然存储(压缩字符串的)长度,后跟压缩字符串。

我不想在比单个字符串更大的范围内进行压缩,因为任何字符串都可以随时读取/写入。

有哪些库/算法可用于执行此操作?

感谢您的帮助。 尼克B

I have some C code that stores ASCII strings in memory as a four byte length followed by the string. The string lengths are in the range 10-250 bytes.

To reduce occupancy I'd like to compress each string individually on the fly, still storing the length (of the compressed string) followed by the compressed string.

I don't want to compress at a larger scope than individual strings because any string can be read/written at any time.

What libraries/algorithms are available for doing this?

Thanks for your help.
NickB

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

つ低調成傷 2024-08-02 21:01:56

ZLib 始终为您服务 - 当字符串包含不可压缩数据时,它的开销非常小,速度相对较快,免费,可以轻松集成到 C 和 C++ 程序中。

ZLib is always at your service - it has a very little overhead for the cases when the string contains uncompressable data, it's relatively fast, free and can be easily integrated into C and C++ programs.

莫多说 2024-08-02 21:01:56

大多数压缩算法对于短字符串不能很好地工作。
以下是一些旨在压缩短英文文本字符串的压缩算法。
虽然它们可以处理明文字符串中的任意字节,
这些字节通常会使“压缩”数据比明文更长。
因此,压缩器最好不改变地存储“不可压缩”数据,并在这些数据上设置“文字”标志(正如 Steve Jessop 建议的那样)。

  • “base 40 编码”:最大压缩 3:2
  • “Zork 信息交换标准代码”(ZSCII):最大压缩 3:2
  • 字节对压缩:最大压缩 2:1
  • 在所有字符串之间共享的静态霍夫曼表(如 cygil 所建议的)。
    • 理想情况下,由所有实际数据的确切字符频率组成。
    • Varicode:最大压缩比 2:1
  • PalmDoc 压缩(字节对压缩 + LZ77 的简单变体)。

Most compression algorithms don't work very well with short strings.
Here are a few compression algorithms that are designed to compress short English text strings.
While they can handle any arbitrary byte in the plaintext string,
such bytes often make the "compressed" data longer than the plaintext.
So it's a good idea for the compressor to store "uncompressible" data unchanged and set a "literal" flag on such data (as Steve Jessop suggested).

  • "base 40 encoding": maximum compression 3:2
  • "Zork Standard Code for Information Interchange" (ZSCII): maximum compression 3:2
  • byte pair compression: maximum compression 2:1
  • a static Huffman table shared among all the strings (as suggested out by cygil).
    • ideally, formed from the exact character frequencies of all of your actual data.
    • Varicode: maximum compression 2:1
  • PalmDoc compression (byte pair compression + a simple variant of LZ77).
何时共饮酒 2024-08-02 21:01:56

当字符串长度为 10-250 字节时,为什么要使用 4 字节长度?使用 1 字节长度就可以为每个字符串节省 3 字节。

数据只是文本,即 0-9 Az 还是某个子集? 如果是这样,请重新编码以使用该子集并为每个字符保存一些位。

现在看看 http://gnosis.cx/publish/programming/compression_primer.html 在 Huffman 编码部分和 lempel-zev 部分。

这应该可以帮助你开始。

Why use a 4 byte length when strings are 10-250 bytes long, use a 1 byte length that will save you 3 bytes per string alone.

Is the data textual only ie 0-9 A-z or some sub set?? if so re-encode it to use that subset and save a few bits per character.

Now have a look at http://gnosis.cx/publish/programming/compression_primer.html in the Huffman encoding section and lempel-zev section.

That should get you started.

优雅的叶子 2024-08-02 21:01:56

我不确定 zlib 或 LZW 压缩方法在单独压缩小于 250 字节的短字符串的情况下是否能正常工作。 两者通常都需要创建相当大的字典才能看到显着的压缩增益。

也许使用固定编码树进行简单的霍夫曼编码,或者在字符串的所有实例之间共享一个编码树? 另外,您是否见过 80 年代在内存有限的微型计算机上用于压缩短字符串的 ZSCII 编码?

链接文本

I am not sure that the zlib or LZW compression approaches will work well in the case of individually compressing short strings of less than 250 bytes. Both typically require creating a fairly sizable dictionary before significant compression gains are seen.

Perhaps simple Huffman coding with a fixed encoding tree, or one shared between all instances of the strings? Also, have you seen the ZSCII encoding used to compress short strings on memory constrained microcomputers in the 80s?

link text

南汐寒笙箫 2024-08-02 21:01:56

Zlib 绝对是您的朋友,但请务必执行一些测试来检测压缩开始有益的平均字符串长度,因为压缩标头的开销很小。

例如,您可能会发现,在 20 个字符以下,压缩后的字符串实际上更大,因此仅压缩较长的字符串。

Zlib is definitely your friend here, but be sure to perform a few tests to detect the average string length at which compression starts to be beneficial, because of the small overhead of compression headers.

For example, you might discover that under 20 characters, the compressed string is actually bigger, and therefore only compress the longer strings.

墨小沫ゞ 2024-08-02 21:01:56

当使用像这样的多个字符串时,可以通过将它们与 \0(1 字节)连接在一起并使用查找函数来避免每个字符串(每个字符串 4 或 8 字节)的指针开销。

#include <stdio.h>

static const char strings[]="hello\0world\0test";

char * nthstring(const char *s, unsigned n){
    while(n--)
        while(*s++)
        ;
    return s;
}
int main(void) {
    printf("%s\n",nthstring(strings,1));
    return 0;
}

但是,如果字符串长度小于 UCHAR_MAX,您可以通过使用零字节占位符来存储长度(在开头加上 1 个额外的长度)来优化查找,这仅需要 1 个额外的数据字节,但可以节省大量条件跳转和增量查找功能。

#include <stdio.h>
/* each "string" is prefixed with its octal length */
static const char lenstrings[]="\05hello\05world\04test";

char * ithstring(const char *s, unsigned n){
    while(n--){
        s+=*s+1;
    }
    return s;
}
int main(void) {
    char *s=ithstring(lenstrings,1);
    /* use the length because we don't have terminating \0 */
    printf ("%.*s",(unsigned char)*s,s+1);
    //write(1,s+1,(unsigned char)*s); //POSIX variation via <unistd.h>
    return 0;
}

对于这两种变体,最好首先保留最常用的字符串; 但是,第二种方法将允许您使用压缩数据(选择最适合您的数据的方法 - David Cary 的答案 有一个可行解决方案列表),只要将长度分隔符调整为压缩长度即可。

注意:要从标准压缩器中获得最大压缩,您可能需要将其标头的长度字段修改为 unsigned char (如果字符串长度超过,则将其修改为 unsigned Short 256 但不是 65536 字节),因为它们中的大多数会尝试支持大文件的压缩(这可以为每个字符串节省 3-7 字节)

When using multiple strings like this it is possible to avoid the pointer overhead for each string (4 or 8 bytes each) by concatenating them together with \0s (1 byte) and using a lookup function.

#include <stdio.h>

static const char strings[]="hello\0world\0test";

char * nthstring(const char *s, unsigned n){
    while(n--)
        while(*s++)
        ;
    return s;
}
int main(void) {
    printf("%s\n",nthstring(strings,1));
    return 0;
}

However if the string length is less than UCHAR_MAX you can optimize the lookup by using the zero byte place holders to store lengths (plus 1 extra at the beginning) This costs only 1 additional data byte but saves a lot of conditional jumps and increments in the lookup function.

#include <stdio.h>
/* each "string" is prefixed with its octal length */
static const char lenstrings[]="\05hello\05world\04test";

char * ithstring(const char *s, unsigned n){
    while(n--){
        s+=*s+1;
    }
    return s;
}
int main(void) {
    char *s=ithstring(lenstrings,1);
    /* use the length because we don't have terminating \0 */
    printf ("%.*s",(unsigned char)*s,s+1);
    //write(1,s+1,(unsigned char)*s); //POSIX variation via <unistd.h>
    return 0;
}

For both variations it is better to keep the most often needed strings first; however, the second method will allow you to use compressed data (pick whichever works best for your data - David Cary's answer has a list of workable solutions) as long as you adjust the length separators to the compressed length.

Note: To get the maximum compression out of standard compressors, you will likely want to modify the length field of their headers to be unsigned char (or unsigned short if string lengths exceed 256 but not 65536 bytes) as most of them will try to support compression of large files (this could save 3-7 bytes per string)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文