如何在计算两个字符串的编辑距离时找到它们的共同部分

发布于 2024-10-05 15:54:29 字数 535 浏览 0 评论 0原文

我必须在源字符串和一组模式字符串之间执行模糊匹配。这种匹配由公式给出 1 - D(I,P) / max(长度(I),长度(P))
其中

  • I 是输入字符串
  • P 是模式字符串
  • D(I,P) 是 I 和 P 之间的编辑距离。

一旦我找到使该分数最大化的 P,我希望获得 I 和 P 的公共部分之间的映射P

例如:如果 I="sunday" 且 P="saturday",则映射将类似于以下对的列表:
{{0, 0}, {1, 3}, {3, 5}, {4, 6}, {5, 7}}
因为常见字符是

中的 's'、'u'、'd'、'a' 和 'y'这篇维基百科文章,人们可以轻松找到一种计算编辑距离的实现,但我并不完全清楚如何从它描述的过程中构建的矩阵中获取映射。谁能启发我吗?

谢谢

I have to perform a fuzzy matching between a source string and a set of pattern strings. This matching is given by the formula
1 - D(I,P) / max(length(I),length(P))
where

  • I is the input string
  • P is a pattern string
  • D(I,P) is the levenshtein distance between I and P.

Once I have found P that maximizes this score, I would like to have the mapping between the common parts of I and P

For instance: if I="sunday" and P="saturday", the mapping would be sth like a list of the following pairs:
{{0, 0}, {1, 3}, {3, 5}, {4, 6}, {5, 7}}
as the common characters are 's', 'u', 'd', 'a' and 'y'

In this wikipedia article, one can easily find an implementation to compute the levenshtein distance but it isn't completely clear to me how I could get the mapping from the matrix built during the process it described. Can anyone enlighten me?

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北恋 2024-10-12 15:54:29

您作为示例给出的映射根本不包含我所看到的编辑距离,因为它只是寻找常见字符。也许我误解了你,但你不需要编辑距离矩阵来映射常见字符;您唯一一次查看编辑距离是在 D(I,P) 计算期间确定得分最高的模式字符串。要生成您作为示例提供的映射,只需迭代两个字符串即可确定用于识别对的字符索引。

The mapping you give as an example doesn't incorporate the edit distance at all from what I can see, as it is just looking for common characters. Maybe I am misunderstanding you, but you do not need the edit distance matrix for your mapping of common characters; the only time you would look at the edit distance would be during your D(I,P) calculation to determine the highest scoring pattern string. To generate the mapping you gave as an example, it would be a simple matter of iterating through both strings to determine the character indices for identifying the pairs.

蓝色星空 2024-10-12 15:54:29

从同一数组的两个副本开始,称为“源”和“目标”,它们是枚举的源字符串中的位置。删除操作会从两个数组中删除相应的元素,并减少目标数组中的后续值。插入会增加目标数组中的以下值。然后只需压缩两个数组并生成地图即可。

Start with two copies of the same array called "source" and "destination", which are the positions in the source string enumerated. Deletions remove the corresponding element from both arrays and decrement following values in the destination array. Insertions increment following values in the destination array. Then just zip both arrays and generate your map.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文