如何在 data.frame 的 groupby 设置中重用使用 apply/sapply 构建的模糊匹配（用于地址）代码，即组内匹配？

发布于 2025-01-17 08:12:24 字数 1738 浏览 1 评论 0原文

我正在尝试对一群人的地址列表进行分组（一个人可以有多个地址映射到他），而系统中捕获的地址存在所有手动不一致的情况，例如某些地方的拼写错误（或）附加信息/标题相同地址的版本。

library(tidyverse)
df <- tibble(
  individuals = c(1, 1, 1, 1, 2, 2),
  addresses = c(
    'king st toronto',
    'queen st',
    'king toronto',
    'broadway st',
    'broadway ave',
    'attn: broadway ave'
  )
)

我最终选择哪一个地址变体并不重要，但所需要做的就是将它们分组/识别为一个相同的地址，例如在新列中。

我使用Levenshtein编辑距离，以及baseR的apply和sapply（如下所示）进行模糊匹配，然后映射到每个人1个唯一的地址（模糊意义上）（这里我选择了字符较少的变体，但任何一种表示都可以）。

matches <-
  sapply(df[['addresses']], function(pattern)
    agrepl(pattern, df[['addresses']], max.distance = 0.3))

apply(matches , 1, function(arg)
  df[['addresses']][arg][which.min(nchar(df[['addresses']][arg]))])

该代码对于 1 个组独立工作，但我无法将其推广到具有多个组的整个 data.frame，例如在 dplyr/groupby 设置中。我尝试使用 plyr:ddply(data.frame, .(groupby_var), ) 但遇到错误“apply() dim(X) 中的错误必须具有正长度”。

预期输出：

个人	地址
1	king toronto
1	queen st
1	king toronto
1	Broadway st
2	Broadway ave
2	broadway ave

原文

I'm trying to group a list of addresses for a bunch of individuals—an individual can have more than 1 address mapped to him—while addresses are captured in the system with all manual inconsistencies e.g. typo (or) additional info/title in some versions of same address.

library(tidyverse)
df <- tibble(
  individuals = c(1, 1, 1, 1, 2, 2),
  addresses = c(
    'king st toronto',
    'queen st',
    'king toronto',
    'broadway st',
    'broadway ave',
    'attn: broadway ave'
  )
)

It doesn't matter which one of an address' variation I'm choosing finally, but all that is required is, group/recognize them as ONE same address, say, in a new column.

I used Levenshtein edit distance, along with baseR's apply and sapply as shown below to do fuzzy matching, and then map to 1 unique address(in fuzzy sense) per individual (here I picked the variation with fewer characters but any one representation is okay).

matches <-
  sapply(df[['addresses']], function(pattern)
    agrepl(pattern, df[['addresses']], max.distance = 0.3))

apply(matches , 1, function(arg)
  df[['addresses']][arg][which.min(nchar(df[['addresses']][arg]))])

This code works as stand-alone for 1 group, but I'm not able to generalize it to entire data.frame with multiple groups, say in a dplyr/groupby setup. I tried using plyr:ddply(data.frame, .(groupby_var), <FUNCTION>) but ran into error 'Error in apply() dim(X) must have a positive length'.

Expected Output:

individuals	addresses
1	king toronto
1	queen st
1	king toronto
1	broadway st
2	broadway ave
2	broadway ave

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦境 2025-01-24 08:12:24

这是一个可能对您有所帮助的选项；我通过将相似的地址粘贴在一起，创建一个新列，其中包含与一定距离内给定的地址相匹配的所有地址。如果您只需要一个地址，则可以将字符串子集到逗号空格分隔符。然后，您可以按组合名称进行分组，并对每组地址执行正常的 tidyverse 功能，或者按单个分组并将它们的所有可能地址粘贴在一起。

有趣的是，Levenshtein 编辑距离是不对称的，因此您可能会遇到问题。例如，“attn：broadway ave”在“broadway st”的 0.3 范围内，但“broadway st”不在“attn：broadway ave”的 0.3 范围内。在下面的示例中，我将距离提高到 0.5，但您可能会在其他地方遇到此问题。

df %>%
  rowwise() %>%
  mutate(agrep_matches=list(agrep(addresses, df$addresses, max.distance = 0.5))) %>%
  mutate(other_names=paste(df$addresses[unlist(agrep_matches)], collapse = ", "))

# A tibble: 6 x 4
# Rowwise: 
  individuals addresses          agrep_matches other_names                                  
        <dbl> <chr>              <list>        <chr>                                        
1           1 king st toronto    <int [2]>     king st toronto, king toronto                
2           1 queen st           <int [1]>     queen st                                     
3           1 king toronto       <int [2]>     king st toronto, king toronto                
4           1 broadway st        <int [3]>     broadway st, broadway ave, attn: broadway ave
5           2 broadway ave       <int [3]>     broadway st, broadway ave, attn: broadway ave
6           2 attn: broadway ave <int [3]>     broadway st, broadway ave, attn: broadway ave

Here's one option that might help you out; I create a new column of all addresses that match the address given within a certain distance by pasting together the similar addresses. If you only want one address then you can subset the string up to the comma-space separator. Then you can group by the combined name and do your normal tidyverse functions on each group of addresses, or group by individual and paste together all the possible addresses for them.

Interestingly, the Levenshtein edit distance is non-symmetrical so you may run into problems with it. For example, "attn: broadway ave" is within 0.3 of "broadway st" but "broadway st" is NOT within 0.3 of "attn: broadway ave". I bumped up your distance to 0.5 in the example below but you'll likely run into this problem elsewhere.

df %>%
  rowwise() %>%
  mutate(agrep_matches=list(agrep(addresses, df$addresses, max.distance = 0.5))) %>%
  mutate(other_names=paste(df$addresses[unlist(agrep_matches)], collapse = ", "))

# A tibble: 6 x 4
# Rowwise: 
  individuals addresses          agrep_matches other_names                                  
        <dbl> <chr>              <list>        <chr>                                        
1           1 king st toronto    <int [2]>     king st toronto, king toronto                
2           1 queen st           <int [1]>     queen st                                     
3           1 king toronto       <int [2]>     king st toronto, king toronto                
4           1 broadway st        <int [3]>     broadway st, broadway ave, attn: broadway ave
5           2 broadway ave       <int [3]>     broadway st, broadway ave, attn: broadway ave
6           2 attn: broadway ave <int [3]>     broadway st, broadway ave, attn: broadway ave

回复收藏 0 原文

~没有更多了~