汉明距离与编辑距离

发布于 2024-10-10 01:24:35 字数 365 浏览 13 评论 0原文

对于我正在研究的问题,找到两个序列之间的距离以确定它们的相似性,序列顺序非常重要。但是,我拥有的序列的长度并不全部相同,因此我用空点填充任何有缺陷的字符串,以使两个序列的长度相同,以满足汉明距离要求。我这样做有什么大问题吗,因为我关心的是换位的数量(而不是像 Levenshtein 那样的插入或删除)?

我发现汉明距离作为较长序列的距离度量比 Levenshtein 快得多。什么时候应该使用编辑距离(或编辑距离的导数)而不是便宜得多的汉明距离?汉明距离可以被认为是两个序列之间可能的 Levenshtein 距离的上限,因此,如果我比较两个序列的顺序偏向相似性度量,而不是匹配序列的绝对最小移动数,则没有明显的差异我选择 Levenshtein 而不是 Hamming 作为度量标准的原因是什么?

For the problem I'm working on, finding distances between two sequences to determine their similarity, sequence order is very important. However, the sequences that I have are not all the same length, so I pad any deficient strings with empty points such that both sequences are the same length in order to satisfy the Hamming distance requirement. Is there any major problem with me doing this, since all I care about are the number of transpositions (not insertions or deletions like Levenshtein does)?

I've found that Hamming distance is much, much faster than Levenshtein as a distance metric for sequences of longer length. When should one use Levenshtein distance (or derivatives of Levenshtein distance) instead of the much cheaper Hamming distance? Hamming distance can be considered the upper bound for possible Levenshtein distances between two sequences, so if I am comparing the two sequences for a order-biased similarity metric rather than the absolute minimal number of moves to match the sequences, there isn't an apparent reason for me to choose Levenshtein over Hamming as a metric, is there?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

暮光沉寂 2024-10-17 01:24:35

这个问题实际上取决于您匹配的序列类型以及您想要的结果。

如果“1234567890”和“0123456789”被认为完全不同不是问题,那么汉明距离确实没问题。

That question really depends on the types of sequences you are matching, and what result you want.

If it's not a problem that "1234567890" and "0123456789" are considered totally different, indeed Hamming distance is fine.

浅浅淡淡 2024-10-17 01:24:35

除了 Johan 的正确答案之外,填充也可能存在问题。

例如,当您将 123123456 进行比较时,如果您在字符串末尾或字符串开头进行填充,结果会有所不同。 ___123123456 的相似度为 0,但 123___123456 的相似度为 3。

In addition to the right Johan answer, the padding can be problematic.

For example, when you compare 123 to 123456 it's different if you pad either at the end of the string or at the start of the string. The similarity of ___123 with 123456 is 0, but The similarity of 123___ with 123456 is 3.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文