最有效的单词划分算法?
我一直在寻找一种有效的单词划分算法,但没有取得太大成功。例如,给定单词 hello,我想获取该单词的所有可能分区:{h,e,l,l,o},{h,e,l,lo},{h,e,llo},。 ..,{你好}。我发现的所有内容都在谈论分词,但这不是我的意思。
先感谢您!
I've been looking for an efficient word partitioning algorithm but without much success. For example, given the word hello I want to obtain all the possible partitions of that word: {h,e,l,l,o},{h,e,l,lo},{h,e,llo},...,{hello}. Everything I found talks about word splitting which isn't what I mean.
Thank you in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您展示了一些示例,我们可以在其中集中讨论逗号。
要么有逗号,要么没有。
所以很明显,在 4 个位置上,可能有逗号,也可能没有,彼此独立。你需要 4 位来编码分区,这是 2^4 种可能性,我猜是 16。
所以你可以形成一个循环:
并在迭代 i 的二进制表示的位的同时迭代你的单词。例如,对于 11,您设置了位:8+2+1 = 1011。这意味着 {h,el,l,o}。
You show some examples, where we can concentrate on the commas.
Either there is a comma or not.
So it seems obvious, that at 4 positions, there may be a comma or not, independently from each other. You need 4 Bit to encode the partitions, which is 2^4 possibilities, I guess that is 16.
So you can form a loop:
and iterate through your word while iterating over the bits of the binary representation of i. For example for 11, you have the bits: 8+2+1 = 1011 set. That means {h,el,l,o}.
该问题是NP完全问题,需要通过回溯来解决。
这个想法是在每个级别,您决定该角色是否属于当前分区或应该转到新分区。递归地执行此操作,每次到达单词末尾时,就会有一个分区。
The problem is NP complete and needs to be solved by backtracking.
The idea is at each level, you decide whether this character belongs to current partition or should go to a new one. Do this recursively and every time you reach end of the word, you have one partition.
您很可能想构建一个后缀特里树。
Most likey you want to construct a suffix-trie.