r pattern-matching string-matching fuzzy

“名字姓氏”/“姓氏名字”的与顺序无关的模糊匹配在 R 中？

发布于 2025-01-02 07:25:19 字数 248 浏览 3 评论 0原文

我有两份分别收集的同一组学生的名单。有很多印刷错误，我一直在使用模糊匹配来链接两个列表。我对 agrep 和类似的东西有 99+% 的支持，但我遇到了以下基本问题：我如何匹配（例如）名字“Adrian Bruce”和“Bruce Adrian”？ Levenshtein 编辑距离不适用于这种特殊情况，因为它计算替换的数量。

这一定是一个非常常见的问题，但我找不到任何标准 R 包或例程来解决它。我想我错过了一些明显的东西......？？？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东京女 2025-01-09 07:25:19

嗯，一种相当简单的方法是交换单词并再次匹配......

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!

Well, one fairly easy way is to swap the words and match again...

y=c("Bruce Almighty", "Lee, Bruce", "Leroy Brown")
y2 <- sub("(.*) (.*)", "\\2 \\1", y)

agrep("Bruce Lee", y)  # No match
agrep("Bruce Lee", y2) # Match!

回复收藏 0 原文

旧夏天 2025-01-09 07:25:19

我通常使用的技术非常强大，并且对顺序、标点符号等相对不敏感。它基于称为“n-gram”的对象。如果 n=2，则为“二元组”。例如：

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

每个字符串有 11 个二元组。其中有9个是共同的。因此，相似度得分非常高：9/11 或 0.818，其中 1.000 是完美匹配。

我对 R 不是很熟悉，但如果包不存在，这种技术很容易编码。您可以编写一段代码，循环遍历字符串 1 的二元组并计算字符串 2 中包含的二元组数量。

The technique I usually use is pretty robust and relatively insensitive to ordering, punctuation, etc.. It's based on objects called "n-grams". If n=2, "bigrams". For instance:

"Adrian Bruce" --> ("Ad","dr","ri","ia","an","n "," B","Br","ru","uc","ce")
"Bruce Adrian" --> ("Br","ru","uc","ce","e "," A","Ad","dr","ri","ia","an")

Each string has 11 bigrams. 9 of them are in common. Thus, the similarity score is very high: 9/11 or 0.818 where 1.000 is a perfect match.

I am not very familiar with R, but if a package does not exist, this technique is very easy to code. You can write a code that loops through the bigrams of string 1 and tallies how many are contained in string 2.

回复收藏 0 原文

~没有更多了~