用公共部分压缩字符串
我有一个管理大量字符串的应用程序。字符串采用类似路径的格式,有许多共同部分,但没有明确的规则。它们不是文件系统上的路径,但可以这样考虑。 我显然需要优化内存消耗,但又不会牺牲很大的性能。
我正在考虑两种选择:
- 实现一个存储压缩数据的compressed_string
类,但我需要一个固定的字典,但现在找不到一个库。我不想要字节上的霍夫曼,我想要文字上的霍夫曼。
- 在字符串部分实现某种flyweight
模式。
这个问题看起来很常见,我想知道最好的解决方案是什么,或者是否有人知道针对此问题的库。
谢谢
I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be considered like so.
I clearly need to optimize memory consumption but without a big performance sacrifice.
I am considering 2 options:
- implement a compressed_string
class that stores data zipped, but i need a fixed dictionary and i cant find a library for this right now. I don't want a Huffman on bytes, I want it on words.
- implement some kind of flyweight
pattern on string parts.
The problem looks like a common one and I'm wonder what is the best solution to it or if someone knows a library that targets this issue.
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尽管针对您的问题调整特定算法可能很诱人,但这可能需要不合理的时间和精力,而标准压缩技术将立即为您解决内存消耗问题提供巨大的推动力。
处理这个问题的“标准”方法是将源数据分成小块(例如256KB),然后单独压缩它们。当访问块中的数据时,需要首先对其进行解码。因此,最佳的块大小实际上取决于您的应用程序,即应用程序流越多,块就越大;另一方面,随机访问模式越多,块大小越小。
如果您担心压缩/解压缩速度,请使用高速算法。如果解压速度是最重要的指标(对于访问时间),LZ4 之类的东西将为您提供大约 1GB/s 的每核解码性能,因此这可以让您了解每秒可以处理多少个块解码。
如果只注重解压速度,您可以使用高压缩变体 LZ4-HC,它将压缩比提高约 30%,同时还提高解压速度。
Although it might be tempting to tune a specific algorithm for your problem, it is likely to require an unreasonable amount of time and effort, while standard compression techniques will immediately provide you a great boost to solve your memory consumption problem.
The "standard" way to handle this issue is to chunk source data into small blocks (such as 256KB), and compress them individually. When accessing data into the block, you need to decode it first. Therefore, the optimal block size really depends on your application, i.e. the more the application streams, the larger the blocks; on the other hand, the more random access pattern, the smaller the block size.
If you are worried by the compression/decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), something like LZ4 will provide you about 1GB/s decoding performance per core, so this gives you an idea of how many blocks per second you can decode.
If only decompression speed matters, you may use the high-compression variant LZ4-HC, which will boost compression ratio even more by about 30%, while also improving decompression speed.
从某种意义上说,它们是 name、(separator、name)* 形式的层次结构中的定位器?如果是这样,您可以使用interning:存储 name 部分作为指向字符串池的 char const * 元素。这样,您就可以有效地将使用 n 次的名称压缩到刚好超过
n * sizeof(char const *) + strlen(name)
字节。完整路径将成为一系列内部名称,例如std::vector
。看起来
sizeof(char const *)
在 64 位硬件上很大,但您也节省了一些分配开销。或者,如果您出于某种原因知道您永远不会需要超过 65536 个字符串,您可以将它们存储为NAME_TABLE
是static std::unordered_map
。In the sense that they are locators in a hierarchy of the form name, (separator, name)*? If so, you can use interning: store the name parts as
char const *
elements that point into a pool of strings. That way, you effectively compress a name that is used n times to just overn * sizeof(char const *) + strlen(name)
bytes. The full path would become a sequence of interned names, e.g. anstd::vector
.It might seem that
sizeof(char const *)
is big on 64-bit hardware, but you also save some of the allocation overhead. Or, if you know for some reason that you'll never need more than, say, 65536 strings, you might store them aswhere
NAME_TABLE
is anstatic std::unordered_map<uint16_t, char const *>
.