在滑动窗口中查找字符串匹配的算法

发布于 2024-07-14 19:45:00 字数 460 浏览 9 评论 0原文

像 ZIP 这样的文件压缩的​​核心步骤之一就是使用之前解码的文本作为参考源。 例如,编码流可能会说“接下来的 219 个输出字符与 5161 字节前解码流中的字符相同”。 这让您只需 3 个字节左右即可表示 219 个字符。 (ZIP 的功能远不止于此,例如霍夫曼压缩,但我只是谈论参考匹配。)

我的问题是字符串匹配算法的策略是什么。 即使查看 zlib 等源代码似乎也无法很好地描述压缩匹配算法。

该问题可以表述为:给定一个文本块(例如 30K)和一个输入字符串,在 30K 文本中找到与输入字符串前面完全匹配的最长引用。”该算法在迭代时必须高效,即,将通过从前面删除一些字节并在后面添加新字节以及执行新的匹配来更新 30K 文本块。

我对算法的讨论更感兴趣。 (s) 要做到这一点,不是源代码或库(zlib 有非常好的源代码!)我怀疑可能有几种具有不同权衡的方法。

One of the core steps in file compression like ZIP is to use the previous decoded text as a reference source. For example, the encoded stream might say "the next 219 output characters are the same as the characters from the decoded stream 5161 bytes ago." This lets you represent 219 characters with just 3 bytes or so. (There's more to ZIP than that, like Huffman compression, but I'm just talking about the reference matching.)

My question is what the strategy(ies) for the string matching algorithm is. Even looking at source code from zlib and such don't seem to give a good description of the compression matching algorithm.

The problem might be stated as: Given a block of text, say 30K of it, and an input string, find the longest reference in the 30K of text which exactly matches the front of the input string." The algorithm must be efficient when iterated, ie, the 30K block of text will be updated by deleting some bytes from the front and adding new ones to the rear and a new match performed.

I'm a lot more interested in discussions of the algorithm(s) to do this, not source code or libraries. (zlib has very good source!) I suspect there may be several approaches with different tradeoffs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

怼怹恏 2024-07-21 19:45:00

好吧,我注意到您详细介绍了该问题,但没有提及 RFC 1951(DEFLATE 压缩数据格式的规范,即 ZIP 中使用的格式),这让我相信您可能错过了此资源。

他们的基本方法是使用三字节序列作为键的链式哈希表。 只要链不为空,就会扫描其上的所有条目,以 a) 消除错误冲突,b) 消除太旧的匹配,以及 c) 从剩余的匹配中选择最长的匹配。

(请注意,他们的建议是由专利因素决定的;他们可能知道一种更有效的技术,但无法确定它是否未被某人的专利涵盖。就我个人而言,我一直想知道为什么人们不能通过检查从传入数据的第二个字节、第三个字节等开始的三字节序列的匹配来找到最长的匹配,并剔除不匹配的匹配,即,如果您的传入数据是“。 ABCDEFG...”并且您在偏移量 100、302 和 416 处获得了“ABC”的哈希匹配,但“BCD”的唯一哈希匹配位于偏移量 301 处,您知道,除非您有两个完全重合的重叠哈希匹配 - - 不太可能 - 那么 302 是最长的匹配。)

还要注意他们对可选“惰性匹配”的建议(讽刺的是,它做了更多工作):压缩器不是自动采用从传入数据的第一个字节开始的最长匹配,而是从下一个字节开始检查更长的匹配。 如果您的传入数据是“ABCDE ...”,并且滑动窗口中唯一的匹配项是“ABC”和“BCDE”,则最好将“A”编码为文字字节,将“BCDE”编码为一场比赛。

Well, I notice that you go into some detail about the problem but don't mention the information provided in section 4 of RFC 1951 (the specification for the DEFLATE Compressed Data Format, i.e. the format used in ZIP) which leads me to believe you might have missed this resource.

Their basic approach is a chained hash table using three-byte sequences as keys. As long as the chain is not empty, all the entries along it are scanned to a) eliminate false collisions, b) eliminate matches that are too old, and c) pick the longest match out of those remaining.

(Note that their recommendation is shaped by the factor of patents; it may be that they knew of a more effective technique but could not be sure that it was not covered by someone's patent. Personally, I've always wondered why one couldn't find the longest matches by examining the matches for the three-byte sequences that start at the second byte of the incoming data, the third byte, etc. and weeding out matches that don't match up. i.e., if your incoming data is "ABCDEFG..." and you've got hash matches for "ABC" at offsets 100, 302 and 416 but your only hash match for "BCD" is at offset 301, you know that unless you have two entirely coincidental overlapping hash matches -- unlikely -- then 302 is your longest match.)

Also note their recommendation of optional "lazy matching" (which ironically does more work): instead of automatically taking the longest match that starts at the first byte of the incoming data, the compressor checks for an even longer match starting at the next byte. If your incoming data is "ABCDE..." and your only matches in the sliding window are for "ABC" and for "BCDE", you're better off encoding the "A" as a literal byte and the "BCDE" as a match.

安静被遗忘 2024-07-21 19:45:00

您可以查看 LZMA 算法 的详细信息 7-zip。 7-zip 作者声称对 zlib 等人使用的算法进行了改进。

You could look at the details of the LZMA Algorithm used by 7-zip. The 7-zip author claims to have improved on the algorithm used by zlib et al.

仅此而已 2024-07-21 19:45:00

我认为您正在描述最长公共子串问题的修改版本。

I think you're describing a modified version of the Longest Common Substring Problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文