在Python中压缩短英文字符串?

发布于 2024-12-14 08:57:29 字数 663 浏览 0 评论 0 原文

我想安装长度 < 80M 的字符串内存 20 个字符,并使用尽可能少的内存。

我想要一个可以从 Python 驱动的压缩库,它允许我压缩短(<20 个字符)的英文字符串。我有大约 80M,我希望它们能容纳在尽可能少的内存中。

我想要最大程度的无损压缩。 CPU 时间不是瓶颈。

我不希望字典与每个字符串一起存储,因为这会产生很高的开销。

我想压缩到原始大小的 20% 以下。这是合理的,因为英语熵的上限是 1.75 位(Brown 等人,1992,http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf) = 22% 压缩 (1.75/8)。

编辑:

我无法使用 zlib,因为标头太大。 (如果我有一个以 20 字节开头的字符串,则可以没有标头来实现良好的压缩。根据 Roland Illing,zlib 标头 = 200 字节。我没有仔细检查,但我知道它大于 20。)

Huffman编码听起来不错,但它是基于单个标记的,并且不能执行 ngram(多个字符)。

smaz 有一个蹩脚的字典,并且压缩率只有 50%。

我非常喜欢使用现有代码,而不是实现压缩算法。

I would like to fit 80M strings of length < 20 characters in memory and use as little memory as possible.

I would like a compression library that I can drive from Python, that will allow me to compress short (<20 char) English strings. I have about 80M of them, and I would like them to fit in as little memory as possible.

I would like maximum lossless compression. CPU time is not the bottleneck.

I don't want the dictionary stored with each string, because that would be high overhead.

I want to compress to <20% the original size. This is plausible, given that the upper bound of the entropy of English is 1.75 bits (Brown et al, 1992, http://acl.ldc.upenn.edu/J/J92/J92-1002.pdf) = 22% compression (1.75/8).

Edit:

I can't use zlib because the header is too large. (If I have a string that starts at 20 bytes, there can be NO header for there to be good compression. zlib header = 200 bytes according to Roland Illing. I haven't doublechecked, but I know it's bigger than 20.)

Huffman coding sounds nice, except it is based upon individual tokens, and can't do ngrams (multiple characters).

smaz has a crappy dictionary, and compresses to only 50%.

I strongly prefer to use existing code, rather than implement a compression algorithm.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

伊面 2024-12-21 08:57:29

我不希望字典与每个字符串一起存储,因为这会带来很高的开销。

因此,构建一个包含所有所需内容的字符串,并使用任何解决方案一次性压缩它们。这也解决了“标题太大”的问题。

您可以通过多种方式做到这一点。也许最简单的方法是创建字符串列表的 repr();或者您可以使用 pickleshelvejson 模块来创建某种其他类型的序列化表单。

I don't want the dictionary stored with each string, because that would be high overhead.

So build a single string with all of the desired contents, and compress it all at once with whichever solution. This solves the "header is too large" problem as well.

You can do this in a variety of ways. Probably the simplest is to create the repr() of a list of the strings; or you can use the pickle, shelve or json modules to create some other sort of serialized form.

家住魔仙堡 2024-12-21 08:57:29

制作一本包含所有单词的字典。然后,将所有单词转换为与字典中的偏移量相对应的数字。如果需要,您可以使用第一位来指示该单词是大写的。

Make a dictionary of all words. Then, convert all words to numbers corresponding to the offset in the dictionary. If needed, you can use the first bit to indicate that the word is capitalized.

隐诗 2024-12-21 08:57:29

如何使用标准库中的 zipfile

How about using zipfile from the standard library?

听风念你 2024-12-21 08:57:29

英文字符串中的不同字符不超过 128 个。因此你可以用 7 位代码来描述每个字符。请参阅压缩 UTF-8(或其他8 位编码)至 7 位或更少

There are no more than 128 different characters in English strings. Hence you can describe each character with a 7bits code. See Compressing UTF-8(or other 8-bit encoding) to 7 or fewer bits

月亮邮递员 2024-12-21 08:57:29

首先,如果你单独压缩每个20字节的字符串,你的压缩率会很惨。您需要将大量字符串压缩在一起才能真正见证一些切实的好处。

其次,80M 字符串太多了,如果您必须将它们全部解压缩才能提取其中的单个字符串,那么您会对性能感到不满意。将您的输入分成较小但仍然足够大的块。典型值为 64KB,即 3200 个字符串。

然后,您可以独立压缩每个 64KB 块。当您需要访问块中的单个字符串时,您需要解码整个块。

因此,在这里,需要在压缩比(更喜欢更大的块)和随机访问速度(更喜欢更小的块)之间进行权衡。您将成为评委,选出最好的一个。

小提示:内存结构上的随机访问通常有利于快速压缩算法,而不是强压缩算法。如果只压缩一次,但随机访问很多次,则更喜欢一些高度不对称的算法,例如 LZ4-HC :
http://code.google.com/p/lz4hc/

根据基准测试,压缩速度仅15MB/s,但解码速度约为1GB/s。这意味着每秒解码 16K 块 64KB...

First, if you compress each 20-bytes string individually, your compression ratio will be miserable. You need to compress a lot of strings together to really witness some tangible benefits.

Second, 80M strings is a lot, and if you have to decompress them all to extract a single one of them, you'll be displeased by performance. Chunk your input into smaller but still large enough blocks. A typical value would be 64KB, translating into 3200 strings.

Then, you can compress each 64KB block independantly. When you need to access a single string into the block, you need to decode the entire block.

So here, there is a trade-off to decide between compression ratio (which prefer larger blocks) and random access speed (which prefer smaller blocks). You'll be the judge to select the best one.

Quick note : random access on in-memory structure usually favor fast compression algorithm, rather than strong ones. If you compress only once, but random access a lot of times, prefer some highly assymetric algorithms, such as LZ4-HC :
http://code.google.com/p/lz4hc/

According to benchmark, compression speed is only 15MB/s, but decoding speed is about 1GB/s. That translates into 16K blocks of 64KB decoded per second...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文