在 R 中,有没有办法识别数据框中两列中的相似字符串值?

发布于 2025-01-11 09:53:05 字数 701 浏览 0 评论 0原文

我有一个包含 70,000 个观察值的大型数据框,其中 A 列和 B 列有一对在同一班次一起工作的护士和医生。不幸的是,这里和那里有一些观察(我不能完全估计有多少,但这是少数),他们在 A 列和 B 列中是同一个人,但他们的名字拼写略有不同,因为添加了中间名或昵称出现在一列中,但没有出现在另一列中。我想创建一个仅包含这些行的数据框。有没有办法使用 %like% 和哪个函数或类似的东西来识别所有这些行?

这是我所拥有的示例:

AB
吉米·法伦哈利·波特
吉米·法伦詹姆斯·法
伦 哈利·波特约翰·奥利弗
哈利·波特哈罗德·波特

我想要什么:

AB
吉米·法伦詹姆斯·法伦 哈利·
波特哈罗德·波特

I have a large dataframe with 70,000 observations with column A and column B having pairs of nurses and physicians who worked together the same shift. Unfortunately there are some observations here and there (I can't quite gauge how many but it's a minority) where they are the same person in column A and column B but their names are spelled slightly differently because of the addition of a middle name or a nickname in one column but not in the other. I want to create a dataframe that ONLY has those rows. Is there a way to use a %like% and which function or something similar to identify all of these rows?

Here is an example of what I have:

AB
Jimmy FallonHarry Potter
Jimmy FallonJames Fallon
Harry PotterJohn Oliver
Harry PotterHarold Potter

What I want:

AB
Jimmy FallonJames Fallon
Harry PotterHarold Potter

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

注定孤独终老 2025-01-18 09:53:05

一种可能的选择是使用 adist,然后使用 filter 来筛选距离较短的行。这种方法假设每列中有一个公共元素(例如,姓氏)。

library(tidyverse)

df %>% 
  rowwise() %>% 
  filter(adist(x=A,y=B,ignore.case=TRUE) <= 3)

输出

  A            B            
  <chr>        <chr>        
1 Jimmy Fallon James Fallon 
2 Harry Potter Harold Potter

或使用基本 R:

df[subset(t(t(mapply(adist, df$A, df$B))) <= 3),]

数据

df <- structure(list(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter", 
"Harry Potter"), B = c("Harry Potter", "James Fallon", "John Oliver", 
"Harold Potter")), class = "data.frame", row.names = c(NA, -4L
))

确定截止

您可能需要根据您的数据更改截止过滤值。然而,当名字拼写有轻微错误时,您可以获得距离并找到最佳截止点。

df2 <- data.frame(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter", "Hary Poter"), 
                 B = c("Harry Potter", "James Fallo", "John Oliver", "Harold Potter"))

df %>% 
  rowwise() %>% 
  mutate(dist = adist(x=A,y=B,ignore.case=TRUE)) %>%
  as.data.frame %>% 
  arrange(dist)

             A             B dist
1 Jimmy Fallon   James Fallo    4
2   Hary Poter Harold Potter    4
3 Harry Potter   John Oliver    9
4 Jimmy Fallon  Harry Potter   10

所以,现在我们知道 4 是更好的过滤截止值。

One possible option is to use adist then filter to the rows that have a low distance. This method kind of assumes that there is a common element in each column (e.g., the last name).

library(tidyverse)

df %>% 
  rowwise() %>% 
  filter(adist(x=A,y=B,ignore.case=TRUE) <= 3)

Output

  A            B            
  <chr>        <chr>        
1 Jimmy Fallon James Fallon 
2 Harry Potter Harold Potter

Or with base R:

df[subset(t(t(mapply(adist, df$A, df$B))) <= 3),]

Data

df <- structure(list(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter", 
"Harry Potter"), B = c("Harry Potter", "James Fallon", "John Oliver", 
"Harold Potter")), class = "data.frame", row.names = c(NA, -4L
))

Determining Cutoff

You might need to change the cutoff filtering value depending on your data. However, you could get the distance and find where your best cutoff would be when names are slightly misspelled.

df2 <- data.frame(A = c("Jimmy Fallon", "Jimmy Fallon", "Harry Potter", "Hary Poter"), 
                 B = c("Harry Potter", "James Fallo", "John Oliver", "Harold Potter"))

df %>% 
  rowwise() %>% 
  mutate(dist = adist(x=A,y=B,ignore.case=TRUE)) %>%
  as.data.frame %>% 
  arrange(dist)

             A             B dist
1 Jimmy Fallon   James Fallo    4
2   Hary Poter Harold Potter    4
3 Harry Potter   John Oliver    9
4 Jimmy Fallon  Harry Potter   10

So, now we know that 4 would be a better cutoff for filtering.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文