检查字符串是否是从子字符串列表构建的算法

发布于 2024-11-07 11:03:13 字数 1013 浏览 4 评论 0原文

给你一个字符串和一个字符串数组。如何快速检查该字符串是否可以通过连接数组中的某些字符串来构建?

这是一个理论问题,出于实际原因我不需要它。但我想知道是否有一些好的算法。

编辑 阅读一些答案我注意到,这可能是 NP 完全问题。即使找到字符串的子集,它们的长度也相同,因为给定的字符串是一个经典的子集和问题。

所以我想这个问题没有简单的答案。

编辑

现在看来,这毕竟不是一个 NP 完全问题。这太酷了:-)

编辑

我想出了一个通过一些测试的解决方案:

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

我相信时间是O(n * m^3),其中nsubstrings 的长度,mstring 的长度。你怎么认为?

You are given a string and an array of strings. How to quickly check, if this string can be built by concatenating some of the strings in the array?

This is a theoretical question, I don't need it for practical reasons. But I would like to know, if there is some good algorithm for this.

EDIT
Reading some answer I have noticed, that this is probably NP-Complete problem. Even finding a subset of strings, that will together have same length, as a given string is a classic subset sum problem.

So I guess there is no easy answer to this.

EDIT

Now it seems, that it is not a NP-Complete problem after all. That's way cooler :-)

EDIT

I have came up with a solution that passes some tests:

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

I believe the time is O(n * m^3), where n is length of substrings and m is length of string. What do you think?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

掀纱窥君容 2024-11-14 11:03:13

注意:我假设您可以多次使用每个子字符串。您可以通过更改我们定义子问题的方式来概括解决方案以包含此限制。这将对空间和预期运行时间产生负面影响,但问题仍然是多项式。

这是一个动态规划问题。 (这是一个很好的问题!)

如果可以使用子字符串列表 W 编写字符串 S,那么我们将 composable(S, W) 定义为 true代码>.

S 是可组合的当且仅当:

  1. SW 中的子字符串 w 开头。
  2. w 之后的 S 的其余部分也是可组合的。

让我们写一些伪代码:

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

该算法的运行时间为 O(m*n),假设子字符串的长度与字符串本身不是线性的,在这种情况下,运行时间将为 O(m*n^2) (其中 m 是子字符串列表的大小,n 是相关字符串的长度)。它使用 O(n) 空间进行记忆。

(注意,伪代码使用 O(n^2) 空间,但对记忆键进行散列可以缓解这种情况。)

编辑

这是一个有效的 Ruby 实现:

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end

Note: I assume here that you can use each substring more than once. You can generalize the solution to include this restriction by changing how we define subproblems. That will have a negative impact on space as well as expected runtime, but the problem remains polynomial.

This is a dynamic programming problem. (And a great question!)

Let's define composable(S, W) to be true if the string S can be written using the list of substrings W.

S is composable if and only if:

  1. S starts with a substring w in W.
  2. The remainder of S after w is also composable.

Let's write some pseudocode:

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

This algorithm has O(m*n) runtime, assuming the length of the substrings is not linear w/r/t to the string itself, in which case runtime would be O(m*n^2) (where m is the size of the substring list and n is the length of the string in question). It uses O(n) space for memoization.

(N.B. as written the pseudocode uses O(n^2) space, but hashing the memoization keys would alleviate this.)

EDIT

Here is a working Ruby implementation:

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end
岁吢 2024-11-14 11:03:13

这绝对不是很快,但你有一个想法:

  • 迭代所有字符串,检查目标字符串是否以其中任何字符串“开头”
  • 获取目标字符串开头的最长字符串,将其从列表中删除并从主字符串中修剪它string
  • 冲洗,重复

当剩下 0 长度的目标字符串时停止。

正如我之前所说,这绝对不是很快,但应该给你一个基线(“它不应该比这更糟糕”)。

编辑

正如评论中指出的,这是行不通的。您将必须存储部分匹配项,并在发现没有进一步的方法时依靠它们。

  • 当发现某个字符串是目标的头部时,将其推送到列表中。建立列表后,你自然会尝试目标中最大的“头”,
  • 当你发现你尝试过的头与剩下的不符时,尝试下一个最好的头,

这样,最终你将探索整个空间解决方案。对于每个候选头,您将尝试所有可能的尾部。

It's definitely not quick but you here's an idea:

  • Iterate over all the strings, checking if the target string "begins" with any of them
  • Take the longest string with which the target string begins, remove it from the list and trim it from the main string
  • Rinse, repeat

Stop when you're left with a 0 length target string.

As I said before, this is definitely not fast but should give you a baseline ("it shouldn't get much worse than this").

EDIT

As pointed out in the comments, this will not work. You will have to store the partial matches and fall back on them when you find there is no way further.

  • When you find that a string is the head of the target, push it onto a list. After building the list, you will naturally try the biggest "head" of the target
  • When you find that the head you tried doesn't fit with what's left, try the next best head

This way, eventually you'll explore the entire space of solutions. For every candidate head you'll try every possible tail.

止于盛夏 2024-11-14 11:03:13

我就是这样做的。

  1. 确定目标字符串的长度。
  2. 确定子字符串数组中每个字符串的长度
  3. 确定哪种子字符串组合会产生与目标字符串长度相同的字符串(如果有,如果没有则完成)
  4. 生成步骤 3 中确定的子字符串组合的所有排列。检查其中是否有任何与目标字符串匹配。

生成所有排列是一项处理器繁重的任务,因此如果您可以减少“n”(输入大小),您将获得相当大的效率。

This is how I would do it.

  1. Determine the length of the target string.
  2. Determine the length of each string in the substring array
  3. Determine which combination of substrings would yield a string with the same length as the target string (If any, if not you're done)
  4. Generate all permutations of the substring combinations determined in step 3. Check if any of them match the target string.

Generating all permutations is a processor heavy task, so if you can cut down on your 'n' (input size), you'll gain some considerable efficiency.

弱骨蛰伏 2024-11-14 11:03:13

受到@cnicutars答案的启发:

  • 函数Possible(array A, string s)
    • 如果 s 为空,则返回 true。
    • 计算 A 中以 s 为前缀的所有字符串的数组 P
    • 如果P为空,则返回 false。
    • 对于 P 中的每个字符串 p
      • 如果可能(A 删除了 p,s 删除了前缀 p) 返回 true
    • 返回错误

Inspired by @cnicutars answer:

  • function Possible(array A, string s)
    • If s is empty, return true.
    • compute the array P of all strings in A that are a prefix of s.
    • If P is empty, return false.
    • for each string p in P:
      • if Possible(A with p removed, s with prefix p removed) return true
    • return false
梦醒时光 2024-11-14 11:03:13

脑海中浮现出两个选项,但它们似乎都不是很优雅。

1) 暴力破解:像密码生成器一样操作,即 word1+word1+word1 >单词1+单词1+单词2> word1+word1+word3 等等等等,

技巧是长度,所以你必须尝试 2 个或更多单词的所有组合,而且你不知道在哪里设置限制。非常耗时。

2) 获取有问题的字符串,并针对一次有 1 个单词的每个单词运行查找。也许检查长度,如果它大于 0,请再做一次。继续这样做,直到你达到零为止,它无法找到更多结果。如果你击中 0 则获胜,否则失败。我认为这种方法会比第一种方法好很多,但我想有人会有更好的建议。

two options sprint to mind but neither of them seem very elegant.

1) brute force: do it like you would a password generator i.e. word1+word1+word1 > word1+word1+word2 > word1+word1+word3 etc etc etc

the trick there is the length so youd have to try all combinations of 2 or more words and you don't know where to set the limit. Very time consuming.

2) take the string in question and run a find in on it for every word you have 1 at a time. maybe check the length and if its greater than 0 do it again. keep doing it till you hit zero it cant find any more results. if you hit 0 its a win if not its a lose. I think this method would be a lot better than the first but I imagine someone will have a better suggestion.

故事和酒 2024-11-14 11:03:13

在我看来,一个问题可以通过简单的线性遍历数组和比较来解决。然而,可能存在多次通过。您可以设计一种策略来最大程度地减少传递次数。例如,在第一遍中构造原始字符串的所有子字符串的子数组。然后线性地尝试不同的变化。

It seems to me a problem can be solved by simple linearly traversing of array and comparison. However there could be multiple pass. You can devise a strategy to minimize the passes. For example constructing a sub array of all the substrings of the original string in first pass. Then try out different variations linearly.

司马昭之心 2024-11-14 11:03:13

这是一个应该可行的粗略想法。

  1. 将源字符串复制到新字符串中,
  2. 而复制字符串仍然有数据并且仍然有子字符串
    一个。获取一个子字符串,如果 copy.contains(substr) copy.remove(substr)
  3. 如果副本现在为空,那么是的,您可以构造字符串
  4. 如果副本不为空,则抛出从字符串中删除的第一个子字符串,然后重复。
  5. 如果所有子字符串都消失了并且 copy 仍然不为空,那么不,您无法构造它。

编辑:
可能改进这一点的一种方法是首先迭代所有子字符串并丢弃主字符串中未包含的任何子字符串。然后执行上述步骤。

Here is a rough idea that should work.

  1. Copy the source string into a new string
  2. While the copy string still has data and there are still substrings
    a. Grab a sub string, if copy.contains(substr) copy.remove(substr)
  3. If the copy is now empty then yes, you could construct the string
  4. If copy is not empty, throw out the first substr that was removed from the string and repeat.
  5. If all substrings are gone and copy is still not empty then no, you can't construct it.

Edit:
A way to possibly improve this would be to first iterate all of the substrings and throw out any that are not contained in the main string. Then go through the above steps.

好倦 2024-11-14 11:03:13

我建议使用后缀树(使用 Ukkonen 的在线算法来构建它),它似乎适合搜索两个文本中的公共子字符串。
您可以在维基百科/特殊来源中找到更多信息。
任务是

Find all z occurrences of the patterns P1..Pn of total length m
enter code hereas substrings in O(m + z) time.

让你看到存在非常酷的解决方案。希望这对你有用。
这实际上更适合重复扫描,而不是单次扫描。

Let me suggest using Suffix Trees (using Ukkonen's online algorithm to build it) which seems to be suitable in terms of searching common substrings in two texts.
You could find more information in wikipedia/special sources.
The task is

Find all z occurrences of the patterns P1..Pn of total length m
enter code hereas substrings in O(m + z) time.

so you see there exists very cool solution. Hope this will work for you.
This is actually more suitable for repeating scans, rather than a single scan.

尬尬 2024-11-14 11:03:13

如果每个子字符串只能使用一次,但不必使用所有子字符串...

对于大小等于原始字符串的子字符串中的每个大小为 N 的排列,请检查它,如果没有,则进行 N+1 的排列项,依此类推,直到穷尽所有排列。

当然,NP 完全,慢得要命,但我认为不存在正常的解决方案。

要解释为什么从原始字符串中删除子字符串的解决方案永远不起作用:

有一个字符串“1234123”和数组“12”,“34”,“123”。如果您从开头删除“123”,则会出现漏报。一个类似的例子,从末尾删除将是:“1234123”:“23,”41“,”123“。

使用贪婪回溯:(m字符串长度7,n num元素3)
- 最长:123
- 从第一次出现中删除它 O(3)
- 尝试其他两个:no go + O((n-1)*(m-3))
- 回溯 O(1)
- 从第二个中删除:O(m-3)
- 尝试其他两个 O((n-1)*m-3)
= O(30)

1 + 2 + 3 的排列 = O(3) + O(4) + O(6) = O(13)。
因此,对于小子集长度,排列实际上比回溯更快。如果您要求查找大量子字符串(在大多数情况下但不是全部),情况将会改变。

您可以仅从数组中删除不存在的子字符串,以将每个删除的不存在子字符串的排列数从 n^n 减少到 n^(n-1)。

If each substring must be used only once but not all of them must be used...

For each permutation of size N from the substrings that is equal in size to the original string check it, if none, do a permutation of N+1 items, end so forth, until you exhaust all the permutations.

Of course NP complete, slow as hell, but i think that no normal solutions exist.

To explain why the solutions where removing substrings from the original string won't ever work:

Have a string "1234123" and array "12","34","123". If you remove "123" from the start, you have a false negative. A similar example where removing from the end would be: "1234123" : "23,"41","123".

With backtracking with greedy: (m string length 7, n num elements 3)
- take the longest: 123
- remove it from first occurence O(3)
- try other two with the rest: no go + O((n-1)*(m-3))
- backtrack O(1)
- remove from second: O(m-3)
- try other two O((n-1)*m-3)
= O(30)

Permutations of 1 + 2 + 3 = O(3) + O(4) + O(6) = O(13).
So for small subset lenght permutations are actually faster than backtracking. This will change if you ask for a lot of substrings to find (in most cases but not all).

You can remove only the nonexisting substrings from the array to lower the number of permutations from n^n to n^(n-1) for each removed nonexisting substring.

愿得七秒忆 2024-11-14 11:03:13

您正在寻找的是一个解析器。解析器将检查某个单词是否属于某种语言。我不确定你的问题的确切计算复杂性。上面的一些似乎是正确的(根本不需要穷举搜索)。可以肯定的一件事是,它不是 NP 完全的。

你的语言的字母表将是所有的小子串。
您要查找的单词就是您拥有的字符串。
正则表达式可以是一个简单的 Kleene 星形,也可以是一个非常简单的上下文无关语法,只不过是 Or 的语法。

算法中的主要问题是:如果某些子字符串实际上是其他子字符串的子字符串……也就是说,如果我们有子字符串:“ab”,“abc”,“abcd”,...,怎么办?在这种情况下,检查子字符串的顺序将改变复杂性。为此,我们有 LR 解析器。我想他们是解决此类问题的最佳人选。

我会尽快为您找到确切的解决方案。

What you are looking for is a parser. A parser will check whether a certain word belongs to a certain language. I am not sure of the exact computattional complexity of your problem. Some of the above seems to be correct (there is no need at all for exhaustive search). One thing for sure, it s not NP-Complete.

The alphabet of your language would be all the small substrings.
The word you are looking for is the string you have.
A regular expression can be a simple Kleene star, or a a very simply context free grammar that is nothing but Or's.

The main issue in the algorithm is: what if the some of the substrings are actually substrings to other substrings ... that is, what if we have substrings: "ab", "abc", "abcd", ... , In this case, the order of checking the substrings will change the complexity. For this, we have LR-parsers. I guess they are the best in solving such problems.

I will find you the exact solution soon.

冬天旳寂寞 2024-11-14 11:03:13

我使用动态规划原理为此提出了一种线性方案。
函数 foo 接受一个字符串和一个子字符串列表来决定是否可以从给定列表构造给定字符串

foo = lambda s, arr:  not s or any(foo(s[len(n):], arr) for n in arr if s[:len(n)]==n)

为了可读性而分解的一个行:

def foo(s, arr):
if not s: return True
for sub in arr:
    if s[:len(sub)] == sub:
        if foo(s[len(sub):], arr):
            return True
return False

I have come up with a one liner for this using dynamic programming principles.
The function foo takes a string and a list of substrings to decide if it's possible to construct the given string from the given list

foo = lambda s, arr:  not s or any(foo(s[len(n):], arr) for n in arr if s[:len(n)]==n)

The one liner broken down for readability:

def foo(s, arr):
if not s: return True
for sub in arr:
    if s[:len(sub)] == sub:
        if foo(s[len(sub):], arr):
            return True
return False
九命猫 2024-11-14 11:03:13

这又如何呢?

substrings = ["a", "bb", "c"]

def checkString(string):
    for substring in substrings:
        if substring in string:
            string = string.replace(substring, "")
    if string == "":
        return True
    else:
        print(string)
        return False

它本质上只是遍历所有允许的子字符串,如果它们存在于字符串中,它将删除它们,然后检查结果字符串是否为 ""

What about this?

substrings = ["a", "bb", "c"]

def checkString(string):
    for substring in substrings:
        if substring in string:
            string = string.replace(substring, "")
    if string == "":
        return True
    else:
        print(string)
        return False

It esentially just goes through all the allowed substrings and if they're present in the string it removes them and then checks if the resulting string is "".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文