We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
我假设我们在这里讨论的不是微不足道的情况(即不仅仅是在空格周围分割字符串,因为这只是一个基本的分词器问题) - 相反,我们正在讨论那里的一些东西不是一个明确的单词分隔符,因此我们必须“猜测” string->words 的最佳匹配是什么 - 例如,一组没有空格的串联单词的情况,例如将其转换
为
:在这种情况下,动态编程方法可能会计算出一个表,其中一个维度对应于序列中的第 M 个单词,另一个维度对应于每个 <输入字符串中的第 n 个字符。那么您为表中每个方格填写的值就是“如果我们在位置
NM
个单词,我们可以获得的最佳匹配分数代码>.I'm going to assume that we're not talking about the trivial case here (i.e. not just splitting a string around spaces, since that'd just be a basic tokenizer problem) - but instead, we're talking about something were there isn't a clear word delimiter character, and thus we're having to "guess" what the best match for string->words would be - for instance, the case of a set of concatenated words w/o spaces, such as transforming this:
into this:
In this case, the dynamic programming approach would probably be to calculate out a table where one dimension corresponds to the
M
th word in the sequence, and the other dimension corresponds to eachN
th character in the input string. Then the value that you fill in for each square of the table is "the best match score we can get if we end (or instead, begin) theM
th word at positionN
.Python Wordsegment 模块有这样的算法。它使用递归和记忆来实现动态编程。
源代码可在 Github 上找到,以下是相关片段:
注意如何< code>memo 缓存对
search
的调用,而search
又从candidates
中选择最大值。The Python wordsegment module has such an algorithm. It uses recursion and memoization to implement dynamic programming.
The source is available on Github, here's the relevant snippet:
Note how
memo
caches calls tosearch
which in turn selects the max fromcandidates
.这是迭代风格的以下解决方案(主要思想是将问题分解为:在输入的特定范围内找到恰好具有 1,2,3..n 个分段单词的分段。如果有任何小的索引错误,请原谅,这些天我的头脑很模糊,但这对你来说是一个迭代的解决方案。):
Here is the followings solution in iterative style (The main idea is breaking up problem into: finding segmentation having exactly 1,2,3..n segmented words within a certain range of the input. Excuse me if there are any minor indexing mistakes, my head is very fuzzy these days. But this is an iterative solution for you.):