找到提供最佳压缩的前缀子字符串

发布于 2024-07-06 04:57:34 字数 646 浏览 10 评论 0原文

问题:

给定一个字符串列表,找到一个子字符串,如果从它匹配的所有字符串的开头减去该子字符串并用转义字节替换,则给出最短的总长度。

示例:

"foo", "fool", "bar"

结果是:“foo”作为基本字符串,包含字符串 "\0""\0l""bar",总长度为 9 个字节。 "\0" 是转义字节。 原始字符串的长度之和是10,所以在这种情况下我们只保存了一个字节。

一个朴素的算法看起来像:

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

这将为我们提供答案,但它类似于 O((n*m)^2),这太昂贵了。

Problem:

Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.

Example:

"foo", "fool", "bar"

The result is: "foo" as the base string with the strings "\0", "\0l", "bar" and a total length of 9 bytes. "\0" is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.

A naive algorithm would look like:

for string in list
  for i = 1, i < length of string
      calculate total length based on prefix of string[0..i]
      if better than last best, save it
return the best prefix

That will give us the answer, but it's something like O((n*m)^2), which is too expensive.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

春花秋月 2024-07-13 04:57:34

使用前缀树(trie)森林...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

然后,我们可以找到最好的结果,并通过最大化 (深度 * 频率) 来保证它,它将被您的转义字符替换。 您可以通过对最大值进行分支定界深度优先搜索来优化搜索。

关于复杂性:O(C),正如评论中提到的,对于构建它以及寻找最优的,这取决于。 如果您对第一个元素的频率进行排序(O(A)——其中 A 是语言字母表的大小),那么您将能够切出更多分支,并且很有可能获得亚线性时间。

我想这已经很清楚了,我不打算写下来——这是什么作业? ;)

Use a forest of prefix trees (trie)...

  f_2    b_1
 /       |
 o_2     a_1
 |       |
 o_2     r_1
 |
 l_1

then, we can find the best result, and guarantee it, by maximizing (depth * frequency) which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.

On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.

I think this is clear, I am not going to write it up --what is this a homework assignment? ;)

云淡月浅 2024-07-13 04:57:34

好吧,第一步是对列表进行排序。 然后遍历列表,将每个元素与前一个元素进行比较,跟踪最长的 2 字符、3 字符、4 字符等运行。 下图是 20 个 3 字符前缀优于 15 个 4 字符前缀。

Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.

青朷 2024-07-13 04:57:34

我会尝试从对列表进行排序开始。 然后,您只需从一个字符串到另一个字符串,将第一个字符与下一个字符串的第一个字符进行比较。 一旦找到匹配项,您就会查看下一个字符。 您需要设计一种方法来跟踪迄今为止的最佳结果。

I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文