检查字符串是否是从子字符串列表构建的算法
给你一个字符串和一个字符串数组。如何快速检查该字符串是否可以通过连接数组中的某些字符串来构建?
这是一个理论问题,出于实际原因我不需要它。但我想知道是否有一些好的算法。
编辑 阅读一些答案我注意到,这可能是 NP 完全问题。即使找到字符串的子集,它们的长度也相同,因为给定的字符串是一个经典的子集和问题。
所以我想这个问题没有简单的答案。
编辑
现在看来,这毕竟不是一个 NP 完全问题。这太酷了:-)
编辑
我想出了一个通过一些测试的解决方案:
def can_build_from_substrings(string, substrings):
prefixes = [True] + [False] * (len(string) - 1)
while True:
old = list(prefixes)
for s in substrings:
for index, is_set in enumerate(prefixes):
if is_set and string[index:].startswith(s):
if string[index:] == s:
return True
prefixes[index + len(s)] = True
if old == prefixes: # nothing has changed in this iteration
return False
我相信时间是O(n * m^3)
,其中n
是substrings
的长度,m
是string
的长度。你怎么认为?
You are given a string and an array of strings. How to quickly check, if this string can be built by concatenating some of the strings in the array?
This is a theoretical question, I don't need it for practical reasons. But I would like to know, if there is some good algorithm for this.
EDIT
Reading some answer I have noticed, that this is probably NP-Complete problem. Even finding a subset of strings, that will together have same length, as a given string is a classic subset sum problem.
So I guess there is no easy answer to this.
EDIT
Now it seems, that it is not a NP-Complete problem after all. That's way cooler :-)
EDIT
I have came up with a solution that passes some tests:
def can_build_from_substrings(string, substrings):
prefixes = [True] + [False] * (len(string) - 1)
while True:
old = list(prefixes)
for s in substrings:
for index, is_set in enumerate(prefixes):
if is_set and string[index:].startswith(s):
if string[index:] == s:
return True
prefixes[index + len(s)] = True
if old == prefixes: # nothing has changed in this iteration
return False
I believe the time is O(n * m^3)
, where n
is length of substrings
and m
is length of string
. What do you think?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
注意:我假设您可以多次使用每个子字符串。您可以通过更改我们定义子问题的方式来概括解决方案以包含此限制。这将对空间和预期运行时间产生负面影响,但问题仍然是多项式。
这是一个动态规划问题。 (这是一个很好的问题!)
如果可以使用子字符串列表
W
编写字符串S
,那么我们将composable(S, W)
定义为 true代码>.S
是可组合的当且仅当:S
以W
中的子字符串w
开头。w
之后的S
的其余部分也是可组合的。让我们写一些伪代码:
该算法的运行时间为 O(m*n),假设子字符串的长度与字符串本身不是线性的,在这种情况下,运行时间将为 O(m*n^2) (其中 m 是子字符串列表的大小,n 是相关字符串的长度)。它使用 O(n) 空间进行记忆。
(注意,伪代码使用 O(n^2) 空间,但对记忆键进行散列可以缓解这种情况。)
编辑
这是一个有效的 Ruby 实现:
Note: I assume here that you can use each substring more than once. You can generalize the solution to include this restriction by changing how we define subproblems. That will have a negative impact on space as well as expected runtime, but the problem remains polynomial.
This is a dynamic programming problem. (And a great question!)
Let's define
composable(S, W)
to be true if the stringS
can be written using the list of substringsW
.S
is composable if and only if:S
starts with a substringw
inW
.S
afterw
is also composable.Let's write some pseudocode:
This algorithm has O(m*n) runtime, assuming the length of the substrings is not linear w/r/t to the string itself, in which case runtime would be O(m*n^2) (where m is the size of the substring list and n is the length of the string in question). It uses O(n) space for memoization.
(N.B. as written the pseudocode uses O(n^2) space, but hashing the memoization keys would alleviate this.)
EDIT
Here is a working Ruby implementation:
这绝对不是很快,但你有一个想法:
当剩下 0 长度的目标字符串时停止。
正如我之前所说,这绝对不是很快,但应该给你一个基线(“它不应该比这更糟糕”)。
编辑
正如评论中指出的,这是行不通的。您将必须存储部分匹配项,并在发现没有进一步的方法时依靠它们。
这样,最终你将探索整个空间解决方案。对于每个候选头,您将尝试所有可能的尾部。
It's definitely not quick but you here's an idea:
Stop when you're left with a 0 length target string.
As I said before, this is definitely not fast but should give you a baseline ("it shouldn't get much worse than this").
EDIT
As pointed out in the comments, this will not work. You will have to store the partial matches and fall back on them when you find there is no way further.
This way, eventually you'll explore the entire space of solutions. For every candidate head you'll try every possible tail.
我就是这样做的。
生成所有排列是一项处理器繁重的任务,因此如果您可以减少“n”(输入大小),您将获得相当大的效率。
This is how I would do it.
Generating all permutations is a processor heavy task, so if you can cut down on your 'n' (input size), you'll gain some considerable efficiency.
受到@cnicutars答案的启发:
Possible(array A, string s)
s
为空,则返回 true。A
中以s
为前缀的所有字符串的数组P
。P
为空,则返回 false。P
中的每个字符串p
:可能(A 删除了 p,s 删除了前缀 p)
返回 trueInspired by @cnicutars answer:
Possible(array A, string s)
s
is empty, return true.P
of all strings inA
that are a prefix ofs
.P
is empty, return false.p
inP
:Possible(A with p removed, s with prefix p removed)
return true脑海中浮现出两个选项,但它们似乎都不是很优雅。
1) 暴力破解:像密码生成器一样操作,即 word1+word1+word1 >单词1+单词1+单词2> word1+word1+word3 等等等等,
技巧是长度,所以你必须尝试 2 个或更多单词的所有组合,而且你不知道在哪里设置限制。非常耗时。
2) 获取有问题的字符串,并针对一次有 1 个单词的每个单词运行查找。也许检查长度,如果它大于 0,请再做一次。继续这样做,直到你达到零为止,它无法找到更多结果。如果你击中 0 则获胜,否则失败。我认为这种方法会比第一种方法好很多,但我想有人会有更好的建议。
two options sprint to mind but neither of them seem very elegant.
1) brute force: do it like you would a password generator i.e. word1+word1+word1 > word1+word1+word2 > word1+word1+word3 etc etc etc
the trick there is the length so youd have to try all combinations of 2 or more words and you don't know where to set the limit. Very time consuming.
2) take the string in question and run a find in on it for every word you have 1 at a time. maybe check the length and if its greater than 0 do it again. keep doing it till you hit zero it cant find any more results. if you hit 0 its a win if not its a lose. I think this method would be a lot better than the first but I imagine someone will have a better suggestion.
在我看来,一个问题可以通过简单的线性遍历数组和比较来解决。然而,可能存在多次通过。您可以设计一种策略来最大程度地减少传递次数。例如,在第一遍中构造原始字符串的所有子字符串的子数组。然后线性地尝试不同的变化。
It seems to me a problem can be solved by simple linearly traversing of array and comparison. However there could be multiple pass. You can devise a strategy to minimize the passes. For example constructing a sub array of all the substrings of the original string in first pass. Then try out different variations linearly.
这是一个应该可行的粗略想法。
一个。获取一个子字符串,如果 copy.contains(substr) copy.remove(substr)
编辑:
可能改进这一点的一种方法是首先迭代所有子字符串并丢弃主字符串中未包含的任何子字符串。然后执行上述步骤。
Here is a rough idea that should work.
a. Grab a sub string, if copy.contains(substr) copy.remove(substr)
Edit:
A way to possibly improve this would be to first iterate all of the substrings and throw out any that are not contained in the main string. Then go through the above steps.
我建议使用后缀树(使用 Ukkonen 的在线算法来构建它),它似乎适合搜索两个文本中的公共子字符串。
您可以在维基百科/特殊来源中找到更多信息。
任务是
让你看到存在非常酷的解决方案。希望这对你有用。
这实际上更适合重复扫描,而不是单次扫描。
Let me suggest using Suffix Trees (using Ukkonen's online algorithm to build it) which seems to be suitable in terms of searching common substrings in two texts.
You could find more information in wikipedia/special sources.
The task is
so you see there exists very cool solution. Hope this will work for you.
This is actually more suitable for repeating scans, rather than a single scan.
如果每个子字符串只能使用一次,但不必使用所有子字符串...
对于大小等于原始字符串的子字符串中的每个大小为 N 的排列,请检查它,如果没有,则进行 N+1 的排列项,依此类推,直到穷尽所有排列。
当然,NP 完全,慢得要命,但我认为不存在正常的解决方案。
要解释为什么从原始字符串中删除子字符串的解决方案永远不起作用:
有一个字符串“1234123”和数组“12”,“34”,“123”。如果您从开头删除“123”,则会出现漏报。一个类似的例子,从末尾删除将是:“1234123”:“23,”41“,”123“。
使用贪婪回溯:(m字符串长度7,n num元素3)
- 最长:123
- 从第一次出现中删除它 O(3)
- 尝试其他两个:no go + O((n-1)*(m-3))
- 回溯 O(1)
- 从第二个中删除:O(m-3)
- 尝试其他两个 O((n-1)*m-3)
= O(30)
1 + 2 + 3 的排列 = O(3) + O(4) + O(6) = O(13)。
因此,对于小子集长度,排列实际上比回溯更快。如果您要求查找大量子字符串(在大多数情况下但不是全部),情况将会改变。
您可以仅从数组中删除不存在的子字符串,以将每个删除的不存在子字符串的排列数从 n^n 减少到 n^(n-1)。
If each substring must be used only once but not all of them must be used...
For each permutation of size N from the substrings that is equal in size to the original string check it, if none, do a permutation of N+1 items, end so forth, until you exhaust all the permutations.
Of course NP complete, slow as hell, but i think that no normal solutions exist.
To explain why the solutions where removing substrings from the original string won't ever work:
Have a string "1234123" and array "12","34","123". If you remove "123" from the start, you have a false negative. A similar example where removing from the end would be: "1234123" : "23,"41","123".
With backtracking with greedy: (m string length 7, n num elements 3)
- take the longest: 123
- remove it from first occurence O(3)
- try other two with the rest: no go + O((n-1)*(m-3))
- backtrack O(1)
- remove from second: O(m-3)
- try other two O((n-1)*m-3)
= O(30)
Permutations of 1 + 2 + 3 = O(3) + O(4) + O(6) = O(13).
So for small subset lenght permutations are actually faster than backtracking. This will change if you ask for a lot of substrings to find (in most cases but not all).
You can remove only the nonexisting substrings from the array to lower the number of permutations from n^n to n^(n-1) for each removed nonexisting substring.
您正在寻找的是一个解析器。解析器将检查某个单词是否属于某种语言。我不确定你的问题的确切计算复杂性。上面的一些似乎是正确的(根本不需要穷举搜索)。可以肯定的一件事是,它不是 NP 完全的。
你的语言的字母表将是所有的小子串。
您要查找的单词就是您拥有的字符串。
正则表达式可以是一个简单的 Kleene 星形,也可以是一个非常简单的上下文无关语法,只不过是 Or 的语法。
算法中的主要问题是:如果某些子字符串实际上是其他子字符串的子字符串……也就是说,如果我们有子字符串:“ab”,“abc”,“abcd”,...,怎么办?在这种情况下,检查子字符串的顺序将改变复杂性。为此,我们有 LR 解析器。我想他们是解决此类问题的最佳人选。
我会尽快为您找到确切的解决方案。
What you are looking for is a parser. A parser will check whether a certain word belongs to a certain language. I am not sure of the exact computattional complexity of your problem. Some of the above seems to be correct (there is no need at all for exhaustive search). One thing for sure, it s not NP-Complete.
The alphabet of your language would be all the small substrings.
The word you are looking for is the string you have.
A regular expression can be a simple Kleene star, or a a very simply context free grammar that is nothing but Or's.
The main issue in the algorithm is: what if the some of the substrings are actually substrings to other substrings ... that is, what if we have substrings: "ab", "abc", "abcd", ... , In this case, the order of checking the substrings will change the complexity. For this, we have LR-parsers. I guess they are the best in solving such problems.
I will find you the exact solution soon.
我使用动态规划原理为此提出了一种线性方案。
函数 foo 接受一个字符串和一个子字符串列表来决定是否可以从给定列表构造给定字符串
为了可读性而分解的一个行:
I have come up with a one liner for this using dynamic programming principles.
The function foo takes a string and a list of substrings to decide if it's possible to construct the given string from the given list
The one liner broken down for readability:
这又如何呢?
它本质上只是遍历所有允许的子字符串,如果它们存在于字符串中,它将删除它们,然后检查结果字符串是否为
""
。What about this?
It esentially just goes through all the allowed substrings and if they're present in the string it removes them and then checks if the resulting string is
""
.