从没有空格/组合词的文本中检测最有可能的单词
如何从组合字符串中检测和拆分单词?
例子:
"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
How could I detect and split words from a combined string?
Example:
"cdimage" -> ["cd", "image"]
"filesaveas" -> ["file", "save", "as"]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是一个动态编程解决方案(作为记忆函数实现)。给定一个单词字典及其出现频率,它会将输入文本分割到给出总体最有可能的短语的位置。您必须找到一个真正的单词列表,但我包含了一些虚构的频率以进行简单的测试。
Here's a dynamic programming solution (implemented as a memoized function). Given a dictionary of words with their frequencies, it splits the input text at the positions that give the overall most likely phrase. You'll have to find a real wordlist, but I included some made-up frequencies for a simple test.
我不知道有什么库,但实现基本功能应该不难。
words
。示例:
I don't know of any library for it, but it shouldn't be hard to implement basic functionality.
words
.Example:
我不知道有哪个库可以执行此操作,但是如果您有一个单词列表,那么编写起来并不太难:
这将返回将字符串拆分为给定单词的所有可能方法。
例子:
I don't know a library that does this, but it's not too hard to write if you have a list of words:
This will return all possible ways to split the string into the given words.
Example:
可以看这个例子:但是它是用 scala 编写的。
当句子之间不包含空格时,这可以分割您想要的任何内容。
Nonspaced-Sentence-Tokenizer
Can see this example : But its written in scala.
This can split anything you want when the sentence contains no space in between.
Nonspaced-Sentence-Tokenizer
我知道这个问题是针对 Python 的,但我需要一个 JavaScript 实现。离开之前的答案,我想我应该分享我的代码。看起来工作还不错。
注意:“_dictionary”应该是按频率排序的单词数组。我正在使用古腾堡计划中的词汇表。
I know this question is marked for Python but I needed a JavaScript implementation. Going off of the previous answers I figured I'd share my code. Seems to work decently.
Note: "_dictionary" is expected to be an array of words sorted by frequency. I am using a wordlist from Project Gutenberg.