找到提供最佳压缩的前缀子字符串
问题:
给定一个字符串列表,找到一个子字符串,如果从它匹配的所有字符串的开头减去该子字符串并用转义字节替换,则给出最短的总长度。
示例:
"foo"
, "fool"
, "bar"
结果是:“foo”作为基本字符串,包含字符串 "\0"
、"\0l"
、"bar"
,总长度为 9 个字节。 "\0"
是转义字节。 原始字符串的长度之和是10,所以在这种情况下我们只保存了一个字节。
一个朴素的算法看起来像:
for string in list
for i = 1, i < length of string
calculate total length based on prefix of string[0..i]
if better than last best, save it
return the best prefix
这将为我们提供答案,但它类似于 O((n*m)^2),这太昂贵了。
Problem:
Given a list of strings, find the substring which, if subtracted from the beginning of all strings where it matches and replaced by an escape byte, gives the shortest total length.
Example:
"foo"
, "fool"
, "bar"
The result is: "foo" as the base string with the strings "\0"
, "\0l"
, "bar"
and a total length of 9 bytes. "\0"
is the escape byte. The sum of the length of the original strings is 10, so in this case we only saved one byte.
A naive algorithm would look like:
for string in list
for i = 1, i < length of string
calculate total length based on prefix of string[0..i]
if better than last best, save it
return the best prefix
That will give us the answer, but it's something like O((n*m)^2), which is too expensive.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用前缀树(trie)森林...
然后,我们可以找到最好的结果,并通过最大化
(深度 * 频率)
来保证它,它将被您的转义字符替换。 您可以通过对最大值进行分支定界深度优先搜索来优化搜索。关于复杂性:O(C),正如评论中提到的,对于构建它以及寻找最优的,这取决于。 如果您对第一个元素的频率进行排序(O(A)——其中 A 是语言字母表的大小),那么您将能够切出更多分支,并且很有可能获得亚线性时间。
我想这已经很清楚了,我不打算写下来——这是什么作业? ;)
Use a forest of prefix trees (trie)...
then, we can find the best result, and guarantee it, by maximizing
(depth * frequency)
which will be replaced with your escape character. You can optimize the search by doing a branch and bound depth first search for the maximum.On the complexity: O(C), as mentioned in comment, for building it, and for finding the optimal, it depends. If you order the first elements frequency (O(A) --where A is the size of the languages alphabet), then you'll be able to cut out more branches, and have a good chance of getting sub-linear time.
I think this is clear, I am not going to write it up --what is this a homework assignment? ;)
好吧,第一步是对列表进行排序。 然后遍历列表,将每个元素与前一个元素进行比较,跟踪最长的 2 字符、3 字符、4 字符等运行。 下图是 20 个 3 字符前缀优于 15 个 4 字符前缀。
Well, first step would be to sort the list. Then one pass through the list, comparing each element with the previous, keeping track of the longest 2-character, 3-character, 4-character etc runs. Then figure is the 20 3-character prefixes better than the 15 4-character prefixes.
我会尝试从对列表进行排序开始。 然后,您只需从一个字符串到另一个字符串,将第一个字符与下一个字符串的第一个字符进行比较。 一旦找到匹配项,您就会查看下一个字符。 您需要设计一种方法来跟踪迄今为止的最佳结果。
I would try starting by sorting the list. Then you simply go from string to string comparing the first character to the next string's first char. Once you have a match you would look at the next char. You would need to devise a way to track the best result so far.