句子的词级编辑距离
是否有一种算法可以让您找到两个句子之间的单词级编辑距离? 例如,“A Big Fat Dog”和“The Big House with the Fat Dog”有 1 个替补,3 个插入
Is there an algorithm that lets you find the word-level edit distance between 2 sentences?
For eg., "A Big Fat Dog" and "The Big House with the Fat Dog" have 1 substitute, 3 insertions
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
一般来说,这称为序列比对问题。实际上,对齐什么实体(位、字符、单词或 DNA 碱基)并不重要,只要算法适用于一种类型的项目,它就适用于其他所有项目。重要的是您想要全局对齐还是局部对齐。
全局比对尝试比对每个序列中的每个残基,当序列相似且大小大致相等时最有用。通用的全局对齐技术是 Needleman-Wunsch 算法 算法,该算法是基于动态规划。当人们谈论莱文斯坦距离时,他们通常指的是全局对齐。该算法非常简单,以至于几个人独立发现了它,有时您可能会遇到 Wagner-Fischer 算法本质上是同一件事,但在两个字符串之间的编辑距离的上下文中更常被提及。
局部比对对于怀疑在较大序列上下文中包含相似区域或相似序列基序的不相似序列更有用。 Smith-Waterman 算法是一种基于动态的通用局部对齐方法编程。它很少用于自然语言处理,而更经常用于生物信息学。
In general, this is called the sequence alignment problem. Actually it does not matter what entities you align - bits, characters, words, or DNA bases - as long as the algorithm works for one type of items it will work for everything else. What matters is whether you want global or local alignment.
Global alignment, which attempt to align every residue in every sequence, is most useful when the sequences are similar and of roughly equal size. A general global alignment technique is the Needleman-Wunsch algorithm algorithm, which is based on dynamic programming. When people talk about Levinstain distance they usually mean global alignment. The algorithm is so straightforward, that several people discovered it independently, and sometimes you may come across Wagner-Fischer algorithm which is essentially the same thing, but is mentioned more often in the context of edit distance between two strings of characters.
Local alignment is more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. It is quite rarely used in natural language processing, and more often - in bioinformatics.
您可以使用与查找字符串中的编辑距离相同的算法来查找句子中的编辑距离。您可以将句子视为从字母表中提取的字符串,其中每个字符都是英语中的一个单词(假设使用空格来标记一个“字符”的开始位置和下一个“字符”的结束位置)。用于计算编辑距离的任何标准算法,例如用于计算 Levenshtein 距离的标准动态编程方法 ,可以用来解决这个问题。
You can use the same algorithms that are used for finding edit distance in strings to find edit distances in sentences. You can think of a sentence as a string drawn from an alphabet where each character is a word in the English language (assuming that spaces are used to mark where one "character" starts and the next ends). Any standard algorithm for computing edit distance, such as the standard dynamic programming approach for computing Levenshtein distance, can be adapted to solve this problem.
这是 @templatetypedef 的想法在 ActionScript 中的示例实现(它对我来说非常有用),它计算归一化 Levenshtein 距离(或者换句话说,给出 [0..1] 范围内的值)
我希望这会有所帮助!
Here is a sample implementation of the @templatetypedef's idea in ActionScript (it worked great for me), which calculates the normalized Levenshtein distance (or in other words gives a value in the range [0..1])
I hope this will help!
D 中的实现可以推广到任何范围,因此也可以推广到数组。因此,通过将句子分割成字符串数组,它们可以通过算法运行,并提供编辑编号。
https://dlang.org/library/std/algorithm/comparison/levenshtein_distance。 html
The implementation in D is generalized over any range, and thus array. So by splitting your sentences into arrays of strings they can be run through the algorithm and an edit number will be provided.
https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html
这是使用动态规划方法的句子编辑距离算法的 Java 实现。
Here is the Java implementation of edit distance algorithm for sentences using dynamic programming approach.
从 nltk 包中查看 python 中的 AlignedSent 函数。它在单词级别对齐句子。
https://www.nltk.org/api/nltk.align.html
check out the AlignedSent function in python from the nltk package. It aligns sentences at the word level.
https://www.nltk.org/api/nltk.align.html