如何衡量字符串之间的相似度?

发布于 2024-11-08 04:45:02 字数 256 浏览 0 评论 0原文

我有很多名字,我想获得唯一的名字。然而,由于拼写错误和数据不一致,名称可能写错。我正在寻找一种方法来检查字符串向量是否其中两个相似。

例如:

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")

我想找到“Obama,B.”“Obama,BH”非常相似。有办法做到这一点吗?

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.

For example:

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")

I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

谎言 2024-11-15 04:45:02

这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:

但大多数情况下 agrep 会做什么你想要:

> sapply(pres,agrep,pres)

这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:

但大多数情况下 agrep 会做什么你想要:

Obama, B.` [1] 1 3

这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:

但大多数情况下 agrep 会做什么你想要:

Bush, G.W.` [1] 2

这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:

但大多数情况下 agrep 会做什么你想要:

Obama, B.H.` [1] 1 3

这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:

但大多数情况下 agrep 会做什么你想要:

Clinton, W.J.` [1] 4

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

> sapply(pres,agrep,pres)

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

Obama, B.` [1] 1 3

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

Bush, G.W.` [1] 2

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

Obama, B.H.` [1] 1 3

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

Clinton, W.J.` [1] 4
云淡月浅 2024-11-15 04:45:02

也许 agrep 就是你想要的?它使用 Levenshtein 编辑距离搜索近似匹配。

lapply(pres, agrep, pres, value = TRUE)

[[1]]
[1] " Obama, B."  "Obama, B.H."

[[2]]
[1] "Bush, G.W."

[[3]]
[1] " Obama, B."  "Obama, B.H."

[[4]]
[1] "Clinton, W.J."

Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.

lapply(pres, agrep, pres, value = TRUE)

[[1]]
[1] " Obama, B."  "Obama, B.H."

[[2]]
[1] "Bush, G.W."

[[3]]
[1] " Obama, B."  "Obama, B.H."

[[4]]
[1] "Clinton, W.J."
晨曦÷微暖 2024-11-15 04:45:02

添加另一个副本以显示它可以处理多个副本。

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")

之间的字符串距离

adist(" Obama, B.", pres)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    9    3   10    7

adist 显示 2 个字符向量 例如,要选择与 “Obama, B.” 最接近的字符串,您可以选择距离最小的字符串。为了避免相同的字符串,我只采用大于零的距离:

d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."

为了获得唯一的名称,考虑到拼写错误和不一致,您可以将每个字符串与之前的所有字符串进行比较。然后,如果有类似的,请将其删除。我创建了一个执行此操作的keepunique() 函数。然后使用 Reduce()keepunique() 连续应用于向量的所有元素。

keepunique <-  function(previousones, x){
    if(any(adist(x, previousones)<5)){
        x <- NULL
    }
    return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B."    "Bush, G.W."    "Clinton, W.J."

Add another duplicate to show it works with more than one duplicate.

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")

adist shows the string distance between 2 character vectors

adist(" Obama, B.", pres)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    9    3   10    7

For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:

d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."

To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().

keepunique <-  function(previousones, x){
    if(any(adist(x, previousones)<5)){
        x <- NULL
    }
    return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B."    "Bush, G.W."    "Clinton, W.J."
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文