如何将给定文本分解为字典中的单词?
这是一道面试题。假设您有一个字符串 text
和一个 dictionary
(一组字符串)。如何将文本
分解为子字符串,以便在字典
中找到每个子字符串。
例如,您可以使用 /usr/share/ 将
。"thisisatext"
分解为 ["this", "is", "a", "text"]
字典/单词
我相信回溯可以解决这个问题(在伪Java中):
void solve(String s, Set<String> dict, List<String> solution) { if (s.length == 0) return for each prefix of s found in dict solve(s without prefix, dict, solution + prefix) } List<String> solution = new List<String>() solve(text, dict, solution)
这有意义吗?您会优化在字典中搜索前缀的步骤吗?您会推荐哪些数据结构?
This is an interview question. Suppose you have a string text
and a dictionary
(a set of strings). How do you break down text
into substrings such that each substring is found in the dictionary
.
For example you can break down "thisisatext"
into ["this", "is", "a", "text"]
using /usr/share/dict/words
.
I believe backtracking can solve this problem (in pseudo-Java):
void solve(String s, Set<String> dict, List<String> solution) { if (s.length == 0) return for each prefix of s found in dict solve(s without prefix, dict, solution + prefix) } List<String> solution = new List<String>() solve(text, dict, solution)
Does it make sense? Would you optimize the step of searching the prefixes in the dictionary? What data structures would you recommend?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该解决方案假设字典存在 Trie 数据结构。此外,对于 Trie 中的每个节点,假设以下函数:
我将把通过反向遍历根来列出组成字符串的单词的部分留给您。
时间复杂度为 O(nk),其中 n 是字符串的长度,k 是字典中最长单词的长度。
PS:我假设字典中有以下单词:this,is,a,text,ate。
This solution assumes the existence of Trie data structure for the dictionary. Further, for each node in Trie, assumes the following functions:
I will leave the part for you to list the words that make up the string by reverse traversing the root.
Time complexity is O(nk) where n is the length of the string and k is the length of the longest word in the dictionary.
PS: I am assuming following words in the dictionary: this, is, a, text, ate.
中有关于此问题解决方案的非常详尽的文章博客文章。
基本思想只是记住您编写的函数,然后您将拥有 O(n^2) 时间、O(n) 空间算法。
There is a very thorough writeup for the solution to this problem in this blog post.
The basic idea is just to memoize the function you've written and you'll have an O(n^2) time, O(n) space algorithm.
方法 1-
Trie 看起来非常适合这里。生成英语词典中单词的 trie。这个特里结构的构建是一次性成本。构建特里树后,您的
字符串
就可以轻松地逐个字母进行比较。如果在任何时候你在特里树中遇到一片叶子,你可以假设你找到了一个单词,将其添加到列表中并添加到列表中。继续你的旅行。进行遍历,直到到达字符串
的末尾。列表被输出。搜索的时间复杂度 - O(word_length)。
空间复杂度 - O(charsize * word_length * no_words)。字典的大小。
方法 2 - 我听说过后缀树,但从未使用过它们它在这里可能有用。
方法 3 - 更加迂腐且简单。一个糟糕的选择。你已经建议过这个。
你可以试试相反的方法。运行 dict 来检查子字符串匹配。这里我假设
dict
中的键是英语词典/usr/share/dict/words
的words
。所以伪代码看起来像这样 -复杂性 - O(n) 运行整个字典 + O(1) 进行子字符串匹配。
空间 - 最坏情况 O(n) if len(words) == len(dict)
正如其他人指出的那样,这确实需要回溯。
Approach 1-
Trie looks to be a close fit here. Generate trie of the words in english dictionary. This trie building is one time cost. After trie is built then your
string
can be easily compared letter by letter. if at any point you encounter a leaf in the trie you can assume you found a word, add this to a list & move on with your traversal. Do the traversal till you have reached the end of yourstring
. The list is output.Time Complexity for search - O(word_length).
Space Complexity - O(charsize * word_length * no_words). Size of your dictionary.
Approach 2 - I have heard of Suffix Trees, never used them but it might be useful here.
Approach 3 - is more pedantic & a lousy alternative. you have already suggested this.
You could try the other way around. Run through the
dict
is check for sub-string match. Here I am assuming the keys indict
are thewords
of the english dictionary/usr/share/dict/words
. So psuedo code looks something like this -Complexity - O(n) running through entire dict + O(1) for substring match.
Space - worst case O(n) if
len(words) == len(dict)
As others have pointed out, this does require backtracking.
您可以使用动态编程和哈希。
计算字典中每个单词的哈希值。使用您最喜欢的哈希函数。我会使用 (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P 之类的东西,其中 a1a2...an 是一个字符串,n是字符串的长度,B 是多项式的底,P 是一个大素数。如果你有字符串 a1a2...an 的哈希值,你可以在常数时间内计算字符串 a1a2...ana(n+1) 的哈希值: (hashValue(a1a2...an) * B + a (n+1)) % P。
这部分的复杂度是O(N * M),其中N是字典中的单词数,M是字典中最长单词的长度。
然后,使用这样的 DP 函数:
该算法的复杂度为 O(N * M),其中 N 是字符串的长度,M 是字典中最长单词的长度或 O(N ^ 2),取决于您是否对改进进行了编码。
因此算法的总复杂度为:O(N1 * M) + O(N2 * M)(或 O(N2 ^ 2)),其中 N1 是字典中单词的数量,M 是单词的长度字典中最长的单词,N2 是字符串的长度)。
如果您想不出一个好的哈希函数(不存在任何冲突),其他可能的解决方案是使用 Tries 或 Patricia trie(如果普通 trie 的大小非常大)(我无法发布链接对于这些主题,因为我的声誉不够高,无法发布超过 2 个链接)。但是当你使用这个时,你的算法的复杂度将是 O(N * M) * O(在 trie 中找到一个单词所需的时间),其中 N 是字符串的长度,M 是最长单词的长度在字典里。
我希望它能有所帮助,对于我糟糕的英语,我深表歉意。
You can solve this problem using Dynamic Programming and Hashing.
Calculate the hash of every word in the dictionary. Use the hash function you like the most. I would use something like (a1 * B ^ (n - 1) + a2 * B ^ (n - 2) + ... + an * B ^ 0) % P, where a1a2...an is a string, n is the length of the string, B is the base of the polynomial and P is a large prime number. If you have the hash value of a string a1a2...an you can calculate the hash value of the string a1a2...ana(n+1) in constant time: (hashValue(a1a2...an) * B + a(n+1)) % P.
The complexity of this part is O(N * M), where N is the number of words in the dictionary and M is the length of the longest word in the dictionary.
Then, use a DP function like this:
The complexity of this algorithm is O(N * M), where N is the length of the string and M is the length of the longest word in the dictionary or O(N ^ 2), depending if you coded the improvement or not.
So the total complexity of the algorithm will be: O(N1 * M) + O(N2 * M) (or O(N2 ^ 2)), where N1 is the number of words in the dictionary, M is the length of the longest word in the dictionary and N2 is the lenght of the string).
If you can't think of a nice hash function (where there are not any collision), other possible solution is to use Tries or a Patricia trie (if the size of the normal trie is very large) (I couldn't post links for these topics because my reputation is not high enough to post more than 2 links). But in you use this, the complexity of your algorithm will be O(N * M) * O(Time needed to find a word in the trie), where N is the length of the string and M is the length of the longest word in the dictionary.
I hope it helps, and I apologize for my poor english.