Delphi TStringList 包装器实现即时压缩

发布于 2024-09-10 04:26:21 字数 338 浏览 9 评论 0原文

我有一个用于在 TStringList 中存储许多字符串的应用程序。这些字符串在很大程度上彼此相似,并且我想到可以即时压缩它们 - 即以唯一文本片段加上对先前存储的片段的引用的混合形式存储给定的字符串。字符串列表(例如完全限定的路径和文件名列表)应该能够被大大压缩。

有谁知道实现此功能的 TStringlist 后代 - 即提供对未压缩字符串的读写访问,但将它们存储在内部压缩,以便 TStringList.SaveToFile 生成压缩文件?

虽然您可以通过在每次访问之前解压缩整个字符串列表并在之后重新压缩来实现此目的,但它会不必要地变慢。我正在寻找对增量操作和随机“搜索”和读取有效的东西。

TIA 罗斯

I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.

Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?

While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.

TIA
Ross

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜深人未静 2024-09-17 04:26:21

我认为没有任何免费可用的实现(无论如何我都不知道,尽管我已经在商业代码中编写了至少 3 个类似的构造),所以你必须推出自己的实现。

马塞洛关于按顺序添加项目的评论非常相关,因为我想您可能希望在添加时压缩数据 - 快速访问已经与添加的条目相似的条目,可以提供比必须要更好的性能在整个集合中查找“最适合的条目”(相似性压缩所需)。

您可能想阅读的另一件事是“绳索”——概念上与字符串不同的类型,它我不久前已经向 Marco Cantu 提出了建议。以每个“twine”的下一个指针为代价(由于缺乏更好的词),您可以连接字符串的各个部分,而不保留任何重复的数据。主要问题是如何检索可以组合成新“绳子”的部分,代表原始字符串。一旦问题得到解决,您就可以随时将数据重建为字符串,同时仍然具有紧凑的存储空间。

如果您不想走“绳索”路线,您还可以尝试称为“前缀减少”的方法,这是一种简单的压缩形式 - 只需使用前一个字符串的索引和字符数来开始每个字符串应将其视为新字符串的前缀。请注意,您不应该递归得太远,否则访问速度会受到很大影响。在一个简单的实现中,我对索引执行了 mod 16,以建立开始前缀缩减的条目,这平均节省了大约 40% 的内存(这个数字完全取决于数据)当然)。

I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.

The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.

Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.

If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).

抚你发端 2024-09-17 04:26:21

您可以尝试将 Delphi 或 COM API 封装在 Judy 数组 周围。 JudySL 类型就可以解决这个问题,并且有一个相当简单的界面。

编辑:我假设您正在存储唯一的字符串,并且希望(或乐意)按字典顺序存储它们。如果这些约束不可接受,那么 Judy 数组不适合您。请注意,如果不对字符串进行排序,任何压缩系统都会受到影响。

You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.

EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.

轻拂→两袖风尘 2024-09-17 04:26:21

我想您期望列表具有一般灵活性(包括删除操作),在这种情况下,我不知道任何开箱即用的解决方案,但我建议采用两种方法之一:

  • 您将字符串拆分为词和
    保持分开增长的字典
    引用单词并在内部保存索引列表

  • 您实现了与
    zlib 流在 Delphi 中可用,但由以下块操作
    例如可以包含 10-100
    字符串。在这种情况下你仍然有
    重新压缩/压缩完整的
    块,但您支付的“价格”较低。

I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:

  • You split your string into words and
    keep separated growning dictionary
    to reference the words and save list of indexes internally

  • You implement something related to
    zlib stream available in Delphi, but operating by the block that
    for example can contains 10-100
    strings. In this case you still have
    to recompress/compress the complete
    block, but the "price" you pay is lower.

时间你老了 2024-09-17 04:26:21

我不认为你真的想压缩内存中的 TStrings 项目,因为它效率非常低。我建议您查看 Zlib 单元中的 TStream 实现。只需在加载时将常规流包装到 TDecompressionStream 中,在保存时将常规流包装到 TCompressionStream 中(您甚至可以在那里发出 gzip 标头)。
提示:您需要重写 LoadFromStream/SaveToStream 而不是 LoadFromFile/SaveToFile

I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文