用公共部分压缩字符串

发布于 2024-12-09 14:21:54 字数 319 浏览 0 评论 0原文

我有一个管理大量字符串的应用程序。字符串采用类似路径的格式,有许多共同部分,但没有明确的规则。它们不是文件系统上的路径,但可以这样考虑。 我显然需要优化内存消耗,但又不会牺牲很大的性能。

我正在考虑两种选择:
- 实现一个存储压缩数据的compressed_string类,但我需要一个固定的字典,但现在找不到一个库。我不想要字节上的霍夫曼,我想要文字上的霍夫曼。
- 在字符串部分实现某种flyweight模式。

这个问题看起来很常见,我想知道最好的解决方案是什么,或者是否有人知道针对此问题的库。

谢谢

I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be considered like so.
I clearly need to optimize memory consumption but without a big performance sacrifice.

I am considering 2 options:
- implement a compressed_string class that stores data zipped, but i need a fixed dictionary and i cant find a library for this right now. I don't want a Huffman on bytes, I want it on words.
- implement some kind of flyweight pattern on string parts.

The problem looks like a common one and I'm wonder what is the best solution to it or if someone knows a library that targets this issue.

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蓝海似她心 2024-12-16 14:21:54

尽管针对您的问题调整特定算法可能很诱人,但这可能需要不合理的时间和精力,而标准压缩技术将立即为您解决内存消耗问题提供巨大的推动力。

处理这个问题的“标准”方法是将源数据分成小块(例如256KB),然后单独压缩它们。当访问块中的数据时,需要首先对其进行解码。因此,最佳的块大小实际上取决于您的应用程序,即应用程序流越多,块就越大;另一方面,随机访问模式越多,块大小越小。

如果您担心压缩/解压缩速度,请使用高速算法。如果解压速度是最重要的指标(对于访问时间),LZ4 之类的东西将为您提供大约 1GB/s 的每核解码性能,因此这可以让您了解每秒可以处理多少个块解码。

如果只注重解压速度,您可以使用高压缩变体 LZ4-HC,它将压缩比提高约 30%,同时还提高解压速度。

Although it might be tempting to tune a specific algorithm for your problem, it is likely to require an unreasonable amount of time and effort, while standard compression techniques will immediately provide you a great boost to solve your memory consumption problem.

The "standard" way to handle this issue is to chunk source data into small blocks (such as 256KB), and compress them individually. When accessing data into the block, you need to decode it first. Therefore, the optimal block size really depends on your application, i.e. the more the application streams, the larger the blocks; on the other hand, the more random access pattern, the smaller the block size.

If you are worried by the compression/decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), something like LZ4 will provide you about 1GB/s decoding performance per core, so this gives you an idea of how many blocks per second you can decode.

If only decompression speed matters, you may use the high-compression variant LZ4-HC, which will boost compression ratio even more by about 30%, while also improving decompression speed.

灰色世界里的红玫瑰 2024-12-16 14:21:54

字符串采用类似路径的格式,并且有许多公共部分,但没有明确的规则。

从某种意义上说,它们是 name、(separatorname)* 形式的层次结构中的定位器?如果是这样,您可以使用interning:存储 name 部分作为指向字符串池的 char const * 元素。这样,您就可以有效地将使用 n 次的名称压缩到刚好超过 n * sizeof(char const *) + strlen(name) 字节。完整路径将成为一系列内部名称,例如 std::vector

看起来 sizeof(char const *) 在 64 位硬件上很大,但您也节省了一些分配开销。或者,如果您出于某种原因知道您永远不会需要超过 65536 个字符串,您可以将它们存储为

class interned_name
{
    uint16_t tab_idx;

  public:
    char const *c_str() const
    {
        return NAME_TABLE[tab_idx];
    }
};

NAME_TABLEstatic std::unordered_map

Strings are in a path-like format and have many common parts, but without a clear rule.

In the sense that they are locators in a hierarchy of the form name, (separator, name)*? If so, you can use interning: store the name parts as char const * elements that point into a pool of strings. That way, you effectively compress a name that is used n times to just over n * sizeof(char const *) + strlen(name) bytes. The full path would become a sequence of interned names, e.g. an std::vector.

It might seem that sizeof(char const *) is big on 64-bit hardware, but you also save some of the allocation overhead. Or, if you know for some reason that you'll never need more than, say, 65536 strings, you might store them as

class interned_name
{
    uint16_t tab_idx;

  public:
    char const *c_str() const
    {
        return NAME_TABLE[tab_idx];
    }
};

where NAME_TABLE is an static std::unordered_map<uint16_t, char const *>.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文