如何将给定文本分解为字典中的单词？

发布于 2024-12-26 07:20:23 字数 633 浏览 0 评论 0原文

这是一道面试题。假设您有一个字符串 text 和一个 dictionary （一组字符串）。如何将文本分解为子字符串，以便在字典中找到每个子字符串。

例如，您可以使用 /usr/share/ 将 "thisisatext" 分解为 ["this", "is", "a", "text"]字典/单词。

我相信回溯可以解决这个问题（在伪Java中）：

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

这有意义吗？您会优化在字典中搜索前缀的步骤吗？您会推荐哪些数据结构？

原文

This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substrings such that each substring is found in the dictionary.

For example you can break down "thisisatext" into ["this", "is", "a", "text"] using /usr/share/dict/words.

I believe backtracking can solve this problem (in pseudo-Java):

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

Does it make sense? Would you optimize the step of searching the prefixes in the dictionary? What data structures would you recommend?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小瓶盖 2025-01-02 07:20:24

该解决方案假设字典存在 Trie 数据结构。此外，对于 Trie 中的每个节点，假设以下函数：

node.IsWord() ：如果到该节点的路径是单词，则返回 true。
node.IsChild(char x)：如果存在带有标签 x
节点的子节点，则返回 true。 GetChild(char x)：返回标签为x的子节点

函数annotate( String str, int start, int end, int root[], TrieNode node):
我=开始
当我<=结束时：
    if 节点.IsChild(str[i]):
        节点 = 节点.GetChild( str[i] )
        如果节点.IsWord():
            根[i+1] = 开始
        我+=1
    别的：
        休息;

结束 = len(str)-1
root = [-1 for i in range(len(str)+1)]
对于开始= 0：结束：
    如果 start = 0 或 root[start]>=0：
        注释（str，开始，结束，根，trieRoot）

索引 0 1 2 3 4 5 6 7 8 9 10 11
str: 这是文本
根：-1 -1 -1 -1 0 -1 4 6 -1 6 -1 7

我将把通过反向遍历根来列出组成字符串的单词的部分留给您。

时间复杂度为 O(nk)，其中 n 是字符串的长度，k 是字典中最长单词的长度。

PS：我假设字典中有以下单词：this，is，a，text，ate。

This solution assumes the existence of Trie data structure for the dictionary. Further, for each node in Trie, assumes the following functions:

node.IsWord() : Returns true if the path to that node is a word
node.IsChild(char x): Returns true if there exists a child with label x
node.GetChild(char x): Returns the child node with label x

Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

I will leave the part for you to list the words that make up the string by reverse traversing the root.

Time complexity is O(nk) where n is the length of the string and k is the length of the longest word in the dictionary.

PS: I am assuming following words in the dictionary: this, is, a, text, ate.

回复收藏 0 原文

缺⑴份安定 2025-01-02 07:20:24

中有关于此问题解决方案的非常详尽的文章博客文章。

基本思想只是记住您编写的函数，然后您将拥有 O(n^2) 时间、O(n) 空间算法。

回复收藏 0 原文

埋情葬爱 2025-01-02 07:20:24

方法 1-
Trie 看起来非常适合这里。生成英语词典中单词的 trie。这个特里结构的构建是一次性成本。构建特里树后，您的字符串就可以轻松地逐个字母进行比较。如果在任何时候你在特里树中遇到一片叶子，你可以假设你找到了一个单词，将其添加到列表中并添加到列表中。继续你的旅行。进行遍历，直到到达字符串的末尾。列表被输出。

搜索的时间复杂度 - O(word_length)。

空间复杂度 - O(charsize * word_length * no_words)。字典的大小。

方法 2 - 我听说过后缀树，但从未使用过它们它在这里可能有用。

方法 3 - 更加迂腐且简单。一个糟糕的选择。你已经建议过这个。

你可以试试相反的方法。运行 dict 来检查子字符串匹配。这里我假设 dict 中的键是英语词典 /usr/share/dict/words 的 words。所以伪代码看起来像这样 -

(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

复杂性 - O(n) 运行整个字典 + O(1) 进行子字符串匹配。

空间 - 最坏情况 O(n) if len(words) == len(dict)

正如其他人指出的那样，这确实需要回溯。

Approach 1-
Trie looks to be a close fit here. Generate trie of the words in english dictionary. This trie building is one time cost. After trie is built then your string can be easily compared letter by letter. if at any point you encounter a leaf in the trie you can assume you found a word, add this to a list & move on with your traversal. Do the traversal till you have reached the end of your string. The list is output.

Time Complexity for search - O(word_length).

Space Complexity - O(charsize * word_length * no_words). Size of your dictionary.

Approach 2 - I have heard of Suffix Trees, never used them but it might be useful here.

Approach 3 - is more pedantic & a lousy alternative. you have already suggested this.

You could try the other way around. Run through the dict is check for sub-string match. Here I am assuming the keys in dict are the words of the english dictionary /usr/share/dict/words. So psuedo code looks something like this -

(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

Complexity - O(n) running through entire dict + O(1) for substring match.

Space - worst case O(n) if len(words) == len(dict)

As others have pointed out, this does require backtracking.

回复收藏 0 原文

多像笑话 2025-01-02 07:20:24

您可以使用动态编程和哈希。

计算字典中每个单词的哈希值。使用您最喜欢的哈希函数。我会使用 (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P 之类的东西，其中 a1a2...an 是一个字符串，n是字符串的长度，B 是多项式的底，P 是一个大素数。如果你有字符串 a1a2...an 的哈希值，你可以在常数时间内计算字符串 a1a2...ana(n+1) 的哈希值： (hashValue(a1a2...an) * B + a (n+1)) % P。

这部分的复杂度是O(N * M)，其中N是字典中的单词数，M是字典中最长单词的长度。

然后，使用这样的 DP 函数：

   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }

该算法的复杂度为 O(N * M)，其中 N 是字符串的长度，M 是字典中最长单词的长度或 O(N ^ 2)，取决于您是否对改进进行了编码。

因此算法的总复杂度为：O(N1 * M) + O(N2 * M)（或 O(N2 ^ 2)），其中 N1 是字典中单词的数量，M 是单词的长度字典中最长的单词，N2 是字符串的长度）。

如果您想不出一个好的哈希函数（不存在任何冲突），其他可能的解决方案是使用 Tries 或 Patricia trie（如果普通 trie 的大小非常大）（我无法发布链接对于这些主题，因为我的声誉不够高，无法发布超过 2 个链接）。但是当你使用这个时，你的算法的复杂度将是 O(N * M) * O(在 trie 中找到一个单词所需的时间)，其中 N 是字符串的长度，M 是最长单词的长度在字典里。

我希望它能有所帮助，对于我糟糕的英语，我深表歉意。

You can solve this problem using Dynamic Programming and Hashing.

Calculate the hash of every word in the dictionary. Use the hash function you like the most. I would use something like (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P, where a1a2...an is a string, n is the length of the string, B is the base of the polynomial and P is a large prime number. If you have the hash value of a string a1a2...an you can calculate the hash value of the string a1a2...ana(n+1) in constant time: (hashValue(a1a2...an) * B + a(n+1)) % P.

The complexity of this part is O(N * M), where N is the number of words in the dictionary and M is the length of the longest word in the dictionary.

Then, use a DP function like this:

   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }

The complexity of this algorithm is O(N * M), where N is the length of the string and M is the length of the longest word in the dictionary or O(N ^ 2), depending if you coded the improvement or not.

So the total complexity of the algorithm will be: O(N1 * M) + O(N2 * M) (or O(N2 ^ 2)), where N1 is the number of words in the dictionary, M is the length of the longest word in the dictionary and N2 is the lenght of the string).

If you can't think of a nice hash function (where there are not any collision), other possible solution is to use Tries or a Patricia trie (if the size of the normal trie is very large) (I couldn't post links for these topics because my reputation is not high enough to post more than 2 links). But in you use this, the complexity of your algorithm will be O(N * M) * O(Time needed to find a word in the trie), where N is the length of the string and M is the length of the longest word in the dictionary.

I hope it helps, and I apologize for my poor english.

回复收藏 0 原文

~没有更多了~