将字符串拆分为单词
我正在寻找最有效的算法来从字符串中形成所有可能的单词组合。例如:(
Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
所有单词都应来自字典)。
我可以想到一种暴力方法。 (找到所有可能的子字符串并匹配)但是什么是更好的方法呢?
I am looking for the most efficient algorithm to form all possible combinations of words from a string. For example:
Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
(All words should be from a dictionary).
I can think of a brute force approach. (find all possible substrings and match) but what would be better ways?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用前缀树作为已知单词列表。可能像
myspell
这样的库已经这样做了。尝试使用现成的。一旦找到匹配项(例如“car”),就分割计算:一个分支开始寻找新单词(“rot”),另一个分支继续探索当前开头的变体(“carrot”)。
实际上,每次分割计算时,您都会在字符串中维护一个偏移量对
(start_position, current_position)
的队列。多个线程可以并行地从此队列中弹出,并尝试继续从start_position
开始且已知到该对的current_position
的单词,但不会在那里结束。当找到一个单词时,就会报告该单词,并从队列中弹出另一对单词。当不可能时,不会产生任何结果。当分裂发生时,一个新的对被添加到队列的末尾。最初,队列包含一个(0,0)
。Use a prefix tree for your list of known words. Probably libs like
myspell
already do so. Try using a ready-made one.Once you found a match (e.g. 'car'), split your computation: one branch starts to look for a new word ('rot'), another continues to explore variants of current beginning ('carrot').
Effectively you maintain a queue of pairs
(start_position, current_position)
of offsets into your string every time you split the computation. Several threads can pop from this queue in parallel and try to continue a word that starts fromstart_position
and is already known up tocurrent_position
of the pair, but does not end there. When a word is found, it is reported and another pair is popped from the queue. When it's impossible, no result is generated. When a split occurs, a new pair is added to the end of the queue. Initially the queue contains a(0,0)
.请参阅这个问题,它有更好的答案。这是一个标准的动态规划问题:
如何将字符串拆分为单词。例如:“stringintowords”-> “串成文字”?
See this question which has even better answers. It's a standard dynamic programming problem:
How to split a string into words. Ex: "stringintowords" -> "String Into Words"?
伪代码实现,利用字符串的每个部分都需要是单词的事实,我们不能跳过任何内容。我们从字符串的开头向前工作,直到第一位是单词,然后生成字符串其余部分的所有可能组合。一旦我们做到了这一点,我们就会继续下去,直到我们找到第一个单词的任何其他可能性,依此类推。
这段代码中的问题是您最终将重复计算 - 在您的示例中,您最终必须计算
allPossibleWords("carrot")
两次 - 一次在["forever ", allPossibleWords["carrot"]]
和一次["for", "ever", allPossibleWords["carrot"]]
。所以记住这一点是需要考虑的事情。A psuedocode implementation, exploiting the fact that every part of the string needs to be a word, we can't skip anything. We work forward from the start of the string until the first bit is a word, and then generate all possible combinations of the rest of the string. Once we've done that, we keep going along until we find any other possibilities for the first word, and so on.
The bugbear in this code is that you'll end up repeating calculations - in your example, you'll end up having to calculate
allPossibleWords("carrot")
twice - once in["forever", allPossibleWords["carrot"]]
and once in["for", "ever", allPossibleWords["carrot"]]
. So memoizing this is something to consider.输入字符串:forevercarrot
输出:
forevercarrot
永远的车腐烂
永远的胡萝卜
永远的汽车腐烂
计划:
Input String: forevercarrot
Output:
forever carrot
forever car rot
for ever carrot
for ever car rot
program :