当前位置：文江博客话题详情

检查字符串是否是从子字符串列表构建的算法

发布于 2024-11-07 11:03:13 字数 1013 浏览 4 评论 0原文

给你一个字符串和一个字符串数组。如何快速检查该字符串是否可以通过连接数组中的某些字符串来构建？

这是一个理论问题，出于实际原因我不需要它。但我想知道是否有一些好的算法。

编辑阅读一些答案我注意到，这可能是 NP 完全问题。即使找到字符串的子集，它们的长度也相同，因为给定的字符串是一个经典的子集和问题。

所以我想这个问题没有简单的答案。

编辑

现在看来，这毕竟不是一个 NP 完全问题。这太酷了:-)

编辑

我想出了一个通过一些测试的解决方案：

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

我相信时间是O(n * m^3)，其中n 是substrings 的长度，m 是string 的长度。你怎么认为？

原文

You are given a string and an array of strings. How to quickly check, if this string can be built by concatenating some of the strings in the array?

This is a theoretical question, I don't need it for practical reasons. But I would like to know, if there is some good algorithm for this.

EDIT
Reading some answer I have noticed, that this is probably NP-Complete problem. Even finding a subset of strings, that will together have same length, as a given string is a classic subset sum problem.

So I guess there is no easy answer to this.

EDIT

Now it seems, that it is not a NP-Complete problem after all. That's way cooler :-)

EDIT

I have came up with a solution that passes some tests:

def can_build_from_substrings(string, substrings):
    prefixes = [True] + [False] * (len(string) - 1)
    while True:
        old = list(prefixes)
        for s in substrings:
            for index, is_set in enumerate(prefixes):
                if is_set and string[index:].startswith(s):
                    if string[index:] == s:
                        return True
                    prefixes[index + len(s)] = True
        if old == prefixes: # nothing has changed in this iteration
            return False

I believe the time is O(n * m^3), where n is length of substrings and m is length of string. What do you think?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掀纱窥君容 2024-11-14 11:03:13

注意：我假设您可以多次使用每个子字符串。您可以通过更改我们定义子问题的方式来概括解决方案以包含此限制。这将对空间和预期运行时间产生负面影响，但问题仍然是多项式。

这是一个动态规划问题。（这是一个很好的问题！）

如果可以使用子字符串列表 W 编写字符串 S，那么我们将 composable(S, W) 定义为 true代码>.

S 是可组合的当且仅当：

S 以 W 中的子字符串 w 开头。
w 之后的 S 的其余部分也是可组合的。

让我们写一些伪代码：

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

该算法的运行时间为 O(m*n)，假设子字符串的长度与字符串本身不是线性的，在这种情况下，运行时间将为 O(m*n^2) (其中 m 是子字符串列表的大小，n 是相关字符串的长度）。它使用 O(n) 空间进行记忆。

（注意，伪代码使用 O(n^2) 空间，但对记忆键进行散列可以缓解这种情况。）

编辑

这是一个有效的 Ruby 实现：

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end

Note: I assume here that you can use each substring more than once. You can generalize the solution to include this restriction by changing how we define subproblems. That will have a negative impact on space as well as expected runtime, but the problem remains polynomial.

This is a dynamic programming problem. (And a great question!)

Let's define composable(S, W) to be true if the string S can be written using the list of substrings W.

S is composable if and only if:

S starts with a substring w in W.
The remainder of S after w is also composable.

Let's write some pseudocode:

COMPOSABLE(S, W):
  return TRUE if S = "" # Base case
  return memo[S] if memo[S]

  memo[S] = false

  for w in W:
    length <- LENGTH(w)
    start  <- S[1..length]
    rest   <- S[length+1..-1]
    if start = w AND COMPOSABLE(rest, W) :
      memo[S] = true # Memoize

  return memo[S]

This algorithm has O(m*n) runtime, assuming the length of the substrings is not linear w/r/t to the string itself, in which case runtime would be O(m*n^2) (where m is the size of the substring list and n is the length of the string in question). It uses O(n) space for memoization.

(N.B. as written the pseudocode uses O(n^2) space, but hashing the memoization keys would alleviate this.)

EDIT

Here is a working Ruby implementation:

def composable(str, words)
  composable_aux(str, words, {})
end

def composable_aux(str, words, memo)
  return true if str == ""                # The base case
  return memo[str] unless memo[str].nil?  # Return the answer if we already know it

  memo[str] = false              # Assume the answer is `false`

  words.each do |word|           # For each word in the list:
    length = word.length
    start  = str[0..length-1]
    rest   = str[length..-1]

    # If the test string starts with this word,
    # and the remaining part of the test string
    # is also composable, the answer is true.
    if start == word and composable_aux(rest, words, memo)
      memo[str] = true           # Mark the answer as true
    end
  end

  memo[str]                      # Return the answer
end

回复收藏 0 原文

岁吢 2024-11-14 11:03:13

这绝对不是很快，但你有一个想法：

迭代所有字符串，检查目标字符串是否以其中任何字符串“开头”
获取目标字符串开头的最长字符串，将其从列表中删除并从主字符串中修剪它string
冲洗，重复

当剩下 0 长度的目标字符串时停止。

正如我之前所说，这绝对不是很快，但应该给你一个基线（“它不应该比这更糟糕”）。

编辑

正如评论中指出的，这是行不通的。您将必须存储部分匹配项，并在发现没有进一步的方法时依靠它们。

当发现某个字符串是目标的头部时，将其推送到列表中。建立列表后，你自然会尝试目标中最大的“头”，
当你发现你尝试过的头与剩下的不符时，尝试下一个最好的头，

这样，最终你将探索整个空间解决方案。对于每个候选头，您将尝试所有可能的尾部。

回复收藏 0 原文

止于盛夏 2024-11-14 11:03:13

我就是这样做的。

确定目标字符串的长度。
确定子字符串数组中每个字符串的长度
确定哪种子字符串组合会产生与目标字符串长度相同的字符串（如果有，如果没有则完成）
生成步骤 3 中确定的子字符串组合的所有排列。检查其中是否有任何与目标字符串匹配。

生成所有排列是一项处理器繁重的任务，因此如果您可以减少“n”（输入大小），您将获得相当大的效率。

回复收藏 0 原文

弱骨蛰伏 2024-11-14 11:03:13

受到@cnicutars答案的启发：

函数Possible(array A, string s)
- 如果 s 为空，则返回 true。
- 计算 A 中以 s 为前缀的所有字符串的数组 P。
- 如果P为空，则返回 false。
- 对于 P 中的每个字符串 p：
  - 如果可能（A 删除了 p，s 删除了前缀 p） 返回 true
- 返回错误

回复收藏 0 原文

梦醒时光 2024-11-14 11:03:13

脑海中浮现出两个选项，但它们似乎都不是很优雅。

1) 暴力破解：像密码生成器一样操作，即 word1+word1+word1 >单词1+单词1+单词2> word1+word1+word3 等等等等，

技巧是长度，所以你必须尝试 2 个或更多单词的所有组合，而且你不知道在哪里设置限制。非常耗时。

2) 获取有问题的字符串，并针对一次有 1 个单词的每个单词运行查找。也许检查长度，如果它大于 0，请再做一次。继续这样做，直到你达到零为止，它无法找到更多结果。如果你击中 0 则获胜，否则失败。我认为这种方法会比第一种方法好很多，但我想有人会有更好的建议。

回复收藏 0 原文

故事和酒 2024-11-14 11:03:13

在我看来，一个问题可以通过简单的线性遍历数组和比较来解决。然而，可能存在多次通过。您可以设计一种策略来最大程度地减少传递次数。例如，在第一遍中构造原始字符串的所有子字符串的子数组。然后线性地尝试不同的变化。

回复收藏 0 原文

司马昭之心 2024-11-14 11:03:13

这是一个应该可行的粗略想法。

将源字符串复制到新字符串中，
而复制字符串仍然有数据并且仍然有子字符串
一个。获取一个子字符串，如果 copy.contains(substr) copy.remove(substr)
如果副本现在为空，那么是的，您可以构造字符串
如果副本不为空，则抛出从字符串中删除的第一个子字符串，然后重复。
如果所有子字符串都消失了并且 copy 仍然不为空，那么不，您无法构造它。

编辑：
可能改进这一点的一种方法是首先迭代所有子字符串并丢弃主字符串中未包含的任何子字符串。然后执行上述步骤。

回复收藏 0 原文

好倦 2024-11-14 11:03:13

我建议使用后缀树（使用 Ukkonen 的在线算法来构建它），它似乎适合搜索两个文本中的公共子字符串。
您可以在维基百科/特殊来源中找到更多信息。
任务是

Find all z occurrences of the patterns P1..Pn of total length m
enter code hereas substrings in O(m + z) time.

让你看到存在非常酷的解决方案。希望这对你有用。
这实际上更适合重复扫描，而不是单次扫描。

Let me suggest using Suffix Trees (using Ukkonen's online algorithm to build it) which seems to be suitable in terms of searching common substrings in two texts.
You could find more information in wikipedia/special sources.
The task is

Find all z occurrences of the patterns P1..Pn of total length m
enter code hereas substrings in O(m + z) time.

so you see there exists very cool solution. Hope this will work for you.
This is actually more suitable for repeating scans, rather than a single scan.

回复收藏 0 原文

尬尬 2024-11-14 11:03:13

如果每个子字符串只能使用一次，但不必使用所有子字符串...

对于大小等于原始字符串的子字符串中的每个大小为 N 的排列，请检查它，如果没有，则进行 N+1 的排列项，依此类推，直到穷尽所有排列。

当然，NP 完全，慢得要命，但我认为不存在正常的解决方案。

要解释为什么从原始字符串中删除子字符串的解决方案永远不起作用：

有一个字符串“1234123”和数组“12”，“34”，“123”。如果您从开头删除“123”，则会出现漏报。一个类似的例子，从末尾删除将是：“1234123”：“23，”41“，”123“。

使用贪婪回溯：（m字符串长度7，n num元素3）
- 最长：123
- 从第一次出现中删除它 O(3)
- 尝试其他两个：no go + O((n-1)*(m-3))
- 回溯 O(1)
- 从第二个中删除：O(m-3)
- 尝试其他两个 O((n-1)*m-3)
= O(30)

1 + 2 + 3 的排列 = O(3) + O(4) + O(6) = O(13)。
因此，对于小子集长度，排列实际上比回溯更快。如果您要求查找大量子字符串（在大多数情况下但不是全部），情况将会改变。

您可以仅从数组中删除不存在的子字符串，以将每个删除的不存在子字符串的排列数从 n^n 减少到 n^(n-1)。

回复收藏 0 原文

愿得七秒忆 2024-11-14 11:03:13

您正在寻找的是一个解析器。解析器将检查某个单词是否属于某种语言。我不确定你的问题的确切计算复杂性。上面的一些似乎是正确的（根本不需要穷举搜索）。可以肯定的一件事是，它不是 NP 完全的。

你的语言的字母表将是所有的小子串。
您要查找的单词就是您拥有的字符串。
正则表达式可以是一个简单的 Kleene 星形，也可以是一个非常简单的上下文无关语法，只不过是 Or 的语法。

算法中的主要问题是：如果某些子字符串实际上是其他子字符串的子字符串……也就是说，如果我们有子字符串：“ab”，“abc”，“abcd”，...，怎么办？在这种情况下，检查子字符串的顺序将改变复杂性。为此，我们有 LR 解析器。我想他们是解决此类问题的最佳人选。

我会尽快为您找到确切的解决方案。

回复收藏 0 原文

冬天旳寂寞 2024-11-14 11:03:13

我使用动态规划原理为此提出了一种线性方案。
函数 foo 接受一个字符串和一个子字符串列表来决定是否可以从给定列表构造给定字符串

foo = lambda s, arr:  not s or any(foo(s[len(n):], arr) for n in arr if s[:len(n)]==n)

为了可读性而分解的一个行：

def foo(s, arr):
if not s: return True
for sub in arr:
    if s[:len(sub)] == sub:
        if foo(s[len(sub):], arr):
            return True
return False

I have come up with a one liner for this using dynamic programming principles.
The function foo takes a string and a list of substrings to decide if it's possible to construct the given string from the given list

foo = lambda s, arr:  not s or any(foo(s[len(n):], arr) for n in arr if s[:len(n)]==n)

The one liner broken down for readability:

def foo(s, arr):
if not s: return True
for sub in arr:
    if s[:len(sub)] == sub:
        if foo(s[len(sub):], arr):
            return True
return False

回复收藏 0 原文

九命猫 2024-11-14 11:03:13

这又如何呢？

substrings = ["a", "bb", "c"]

def checkString(string):
    for substring in substrings:
        if substring in string:
            string = string.replace(substring, "")
    if string == "":
        return True
    else:
        print(string)
        return False

它本质上只是遍历所有允许的子字符串，如果它们存在于字符串中，它将删除它们，然后检查结果字符串是否为 ""。

What about this?

substrings = ["a", "bb", "c"]

def checkString(string):
    for substring in substrings:
        if substring in string:
            string = string.replace(substring, "")
    if string == "":
        return True
    else:
        print(string)
        return False

It esentially just goes through all the allowed substrings and if they're present in the string it removes them and then checks if the resulting string is "".

回复收藏 0 原文

~没有更多了~