找到提供最佳压缩的前缀子字符串

发布于 2024-07-06 04:57:34 字数 646 浏览 10 评论 0原文

问题：

给定一个字符串列表，找到一个子字符串，如果从它匹配的所有字符串的开头减去该子字符串并用转义字节替换，则给出最短的总长度。

示例：

"foo", "fool", "bar"

结果是：“foo”作为基本字符串，包含字符串 "\0"、"\0l"、"bar"，总长度为 9 个字节。 "\0" 是转义字节。原始字符串的长度之和是10，所以在这种情况下我们只保存了一个字节。

一个朴素的算法看起来像：

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

这将为我们提供答案，但它类似于 O((n*m)^2)，这太昂贵了。

原文

Problem:

Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.

Example:

"foo", "fool", "bar"

The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.

A naive algorithm would look like:

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

That will give us the answer, but it's something like O((n*m)^2), which is too expensive.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春花秋月 2024-07-13 04:57:34

使用前缀树（trie）森林...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

然后，我们可以找到最好的结果，并通过最大化 （深度 * 频率） 来保证它，它将被您的转义字符替换。您可以通过对最大值进行分支定界深度优先搜索来优化搜索。

关于复杂性：O(C)，正如评论中提到的，对于构建它以及寻找最优的，这取决于。如果您对第一个元素的频率进行排序（O(A)——其中 A 是语言字母表的大小），那么您将能够切出更多分支，并且很有可能获得亚线性时间。

我想这已经很清楚了，我不打算写下来——这是什么作业？ ;)

Use a forest of prefix trees (trie)...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.

On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.

I think this is clear, I am not going to write it up --what is this a homework assignment? ;)

回复收藏 0 原文