匹配R数据框列中的部分数值与另一列中的值匹配

发布于 2025-02-07 22:58:43 字数 1148 浏览 0 评论 0原文

我有一个r数据框 df1 ，看起来像下面：

Product       new_ID
Prod1         129000007
Prod2         7432309490
Prod3         1708289014
Prod4         4741602975
Prod5         906485301012

另一个， df2 ，看起来像：

Brand         old_ID
Brand1        13554998333
Brand2        17432309490
Brand3        14300012960
Brand4        14741602975
Brand5        2710420383988

给出某种上下文，数据来自两个不同的数据库代码（列new_id和old_id）的表示略有不同。例如，与new_id中的值相比，old_id列值中的prod 2和Brand 2在old_id列值中相同。对于产品4和Brand 4。请注意，new_id中的所有代码都是 old_id。

编辑：另请注意，new_id和old_id值之间的差异并非总是领先数字的差异。有时，old_id值的第一个也是最后一个数字被删除以获取new_id值。

因此，我想在 df2 中找到所有的行，这些行使用字段old_id和new_id df1 中的产品。我可以想到使用old_id使用grepl中的new_id值。但是我认为这一次只能完成。

是否有更好的方法可以在new_id值的向量匹配old_id列中？

原文

I have a R data frame df1 that looks like below:

Product       new_ID
Prod1         129000007
Prod2         7432309490
Prod3         1708289014
Prod4         4741602975
Prod5         906485301012

And another one, df2, which looks like:

Brand         old_ID
Brand1        13554998333
Brand2        17432309490
Brand3        14300012960
Brand4        14741602975
Brand5        2710420383988

To give some context, the data comes from two different databases where product codes (columns new_ID and old_ID respectively) are represented slightly differently. For example, Prod 2 and Brand 2 are the same with one extra digit in the old_ID column value compared to the value in new_ID. Same for Prod 4 and Brand 4. Note that all the codes in new_ID are not in old_ID.

Edit: Also note that the difference between the new_ID and old_ID value is not always that of the leading digit. Sometimes the first and last digit of an old_ID value is dropped to get new_ID value.

So I want to find all the rows in df2 which contain the products in df1 using the fields old_ID and new_ID. I could think of matching the new_ID value in old_ID using grepl. But I think that can be done only one at a time.

Is there a better way to find a match of a vector of new_ID values in the old_ID column?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨后咖啡店 2025-02-14 22:58:43

如果确定差异只是一个额外的指导数字，那么您可以删除领先的额外数字并使用合并来查找匹配项：

df1 <- data.frame(Product = c("Prod1",  
                              "Prod2",   
                              "Prod3",   
                              "Prod4",   
                              "Prod5"),
                  new_ID = c("129000007",
                             "7432309490",
                             "1708289014",
                             "4741602975",
                             "906485301012"))
df2 <- data.frame(Brand = c("Brand1",
                            "Brand2",
                            "Brand3",
                            "Brand4",
                            "Brand5"),
                  old_ID = c("13554998333",
                             "17432309490",
                             "14300012960",
                             "14741602975",
                             "2710420383988"))
df2$new_ID <- sub("^.", "", df2$old_ID)
matches <- merge(df1, df2, by = "new_ID")

编辑

matches
#      new_ID Product  Brand      old_ID
#1 4741602975   Prod4 Brand4 14741602975
#2 7432309490   Prod2 Brand2 17432309490

：
如果总是有额外的数字和可能的其他东西，那么FuzzyJoin可以完成这项工作。
我添加了prod6和brand6和另一个最后的数字

df1 <- data.frame(Product = c("Prod1",  
                              "Prod2",   
                              "Prod3",   
                              "Prod4",   
                              "Prod5",   
                              "Prod6"),
                  new_ID = c("129000007",
                             "7432309490",
                             "1708289014",
                             "4741602975",
                             "906485301012",
                             "123456789"))
df2 <- data.frame(Brand = c("Brand1",
                            "Brand2",
                            "Brand3",
                            "Brand4",
                            "Brand5",
                            "Brand6"),
                  old_ID = c("13554998333",
                             "17432309490",
                             "14300012960",
                             "14741602975",
                             "2710420383988",
                             "11234567892"))
#comment next line if leading digit is not always added...
df2$new_ID <- sub("^.", "", df2$old_ID)

require(fuzzyjoin)

#put max_dist = 2 if not always first extra digit 
matches <- stringdist_inner_join(df1, df2, by = "new_ID", method = "lcs", max_dist = 1) 
matches
  Product   new_ID.x  Brand      old_ID   new_ID.y
1   Prod2 7432309490 Brand2 17432309490 7432309490
2   Prod4 4741602975 Brand4 14741602975 4741602975
3   Prod6  123456789 Brand6 11234567892 1234567892

If for sure the difference is only an extra leading digit then you can just remove the leading extra digit and use merge to find matches:

df1 <- data.frame(Product = c("Prod1",  
                              "Prod2",   
                              "Prod3",   
                              "Prod4",   
                              "Prod5"),
                  new_ID = c("129000007",
                             "7432309490",
                             "1708289014",
                             "4741602975",
                             "906485301012"))
df2 <- data.frame(Brand = c("Brand1",
                            "Brand2",
                            "Brand3",
                            "Brand4",
                            "Brand5"),
                  old_ID = c("13554998333",
                             "17432309490",
                             "14300012960",
                             "14741602975",
                             "2710420383988"))
df2$new_ID <- sub("^.", "", df2$old_ID)
matches <- merge(df1, df2, by = "new_ID")

which gives:

matches
#      new_ID Product  Brand      old_ID
#1 4741602975   Prod4 Brand4 14741602975
#2 7432309490   Prod2 Brand2 17432309490

EDIT:
If there s always a leading extra digit and potentially other stuff then fuzzyjoin can do the job.
I added Prod6 and Brand6 with another final digit

df1 <- data.frame(Product = c("Prod1",  
                              "Prod2",   
                              "Prod3",   
                              "Prod4",   
                              "Prod5",   
                              "Prod6"),
                  new_ID = c("129000007",
                             "7432309490",
                             "1708289014",
                             "4741602975",
                             "906485301012",
                             "123456789"))
df2 <- data.frame(Brand = c("Brand1",
                            "Brand2",
                            "Brand3",
                            "Brand4",
                            "Brand5",
                            "Brand6"),
                  old_ID = c("13554998333",
                             "17432309490",
                             "14300012960",
                             "14741602975",
                             "2710420383988",
                             "11234567892"))
#comment next line if leading digit is not always added...
df2$new_ID <- sub("^.", "", df2$old_ID)

require(fuzzyjoin)

#put max_dist = 2 if not always first extra digit 
matches <- stringdist_inner_join(df1, df2, by = "new_ID", method = "lcs", max_dist = 1) 
matches
  Product   new_ID.x  Brand      old_ID   new_ID.y
1   Prod2 7432309490 Brand2 17432309490 7432309490
2   Prod4 4741602975 Brand4 14741602975 4741602975
3   Prod6  123456789 Brand6 11234567892 1234567892

回复收藏 0 原文

~没有更多了~