如何衡量字符串之间的相似度?
我有很多名字,我想获得唯一的名字。然而,由于拼写错误和数据不一致,名称可能写错。我正在寻找一种方法来检查字符串向量是否其中两个相似。
例如:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
我想找到“Obama,B.”
和“Obama,BH”
非常相似。有办法做到这一点吗?
I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B."
and "Obama, B.H."
are very similar. Is there a way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这可以基于例如编辑距离来完成。在不同的包中有多种实现。一些解决方案和包可以在这些问题的答案中找到:
但大多数情况下
agrep
会做什么你想要:This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
But most often
agrep
will do what you want :也许
agrep
就是你想要的?它使用 Levenshtein 编辑距离搜索近似匹配。Maybe
agrep
is what you want? It searches for approximate matches using the Levenshtein edit distance.添加另一个副本以显示它可以处理多个副本。
之间的字符串距离
adist 显示 2 个字符向量 例如,要选择与
“Obama, B.”
最接近的字符串,您可以选择距离最小的字符串。为了避免相同的字符串,我只采用大于零的距离:为了获得唯一的名称,考虑到拼写错误和不一致,您可以将每个字符串与之前的所有字符串进行比较。然后,如果有类似的,请将其删除。我创建了一个执行此操作的keepunique() 函数。然后使用
Reduce()
将keepunique()
连续应用于向量的所有元素。Add another duplicate to show it works with more than one duplicate.
adist shows the string distance between 2 character vectors
For example, to select the closest string to
" Obama, B."
you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a
keepunique()
function that performs this.keepunique()
is then applied to all elements of the vector successively withReduce()
.