汉明距离与编辑距离
对于我正在研究的问题,找到两个序列之间的距离以确定它们的相似性,序列顺序非常重要。但是,我拥有的序列的长度并不全部相同,因此我用空点填充任何有缺陷的字符串,以使两个序列的长度相同,以满足汉明距离要求。我这样做有什么大问题吗,因为我关心的是换位的数量(而不是像 Levenshtein 那样的插入或删除)?
我发现汉明距离作为较长序列的距离度量比 Levenshtein 快得多。什么时候应该使用编辑距离(或编辑距离的导数)而不是便宜得多的汉明距离?汉明距离可以被认为是两个序列之间可能的 Levenshtein 距离的上限,因此,如果我比较两个序列的顺序偏向相似性度量,而不是匹配序列的绝对最小移动数,则没有明显的差异我选择 Levenshtein 而不是 Hamming 作为度量标准的原因是什么?
For the problem I'm working on, finding distances between two sequences to determine their similarity, sequence order is very important. However, the sequences that I have are not all the same length, so I pad any deficient strings with empty points such that both sequences are the same length in order to satisfy the Hamming distance requirement. Is there any major problem with me doing this, since all I care about are the number of transpositions (not insertions or deletions like Levenshtein does)?
I've found that Hamming distance is much, much faster than Levenshtein as a distance metric for sequences of longer length. When should one use Levenshtein distance (or derivatives of Levenshtein distance) instead of the much cheaper Hamming distance? Hamming distance can be considered the upper bound for possible Levenshtein distances between two sequences, so if I am comparing the two sequences for a order-biased similarity metric rather than the absolute minimal number of moves to match the sequences, there isn't an apparent reason for me to choose Levenshtein over Hamming as a metric, is there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这个问题实际上取决于您匹配的序列类型以及您想要的结果。
如果“1234567890”和“0123456789”被认为完全不同不是问题,那么汉明距离确实没问题。
That question really depends on the types of sequences you are matching, and what result you want.
If it's not a problem that "1234567890" and "0123456789" are considered totally different, indeed Hamming distance is fine.
除了 Johan 的正确答案之外,填充也可能存在问题。
例如,当您将
123
与123456
进行比较时,如果您在字符串末尾或字符串开头进行填充,结果会有所不同。___123
与123456
的相似度为 0,但123___
与123456
的相似度为 3。In addition to the right Johan answer, the padding can be problematic.
For example, when you compare
123
to123456
it's different if you pad either at the end of the string or at the start of the string. The similarity of___123
with123456
is 0, but The similarity of123___
with123456
is 3.