匹配R数据框列中的部分数值与另一列中的值匹配
我有一个r数据框 df1 ,看起来像下面:
Product new_ID
Prod1 129000007
Prod2 7432309490
Prod3 1708289014
Prod4 4741602975
Prod5 906485301012
另一个, df2 ,看起来像:
Brand old_ID
Brand1 13554998333
Brand2 17432309490
Brand3 14300012960
Brand4 14741602975
Brand5 2710420383988
给出某种上下文,数据来自两个不同的数据库代码(列new_id
和old_id
)的表示略有不同。例如,与new_id
中的值相比,old_id
列值中的prod 2和Brand 2在old_id
列值中相同。对于产品4和Brand 4。请注意,new_id
中的所有代码都是 old_id
。
编辑:另请注意,new_id
和old_id
值之间的差异并非总是领先数字的差异。有时,old_id
值的第一个也是最后一个数字被删除以获取new_id
值。
因此,我想在 df2 中找到所有的行,这些行使用字段old_id
和new_id
df1 中的产品。我可以想到使用old_id
使用grepl
中的new_id
值。但是我认为这一次只能完成。
是否有更好的方法可以在new_id
值的向量匹配old_id列
中?
I have a R data frame df1 that looks like below:
Product new_ID
Prod1 129000007
Prod2 7432309490
Prod3 1708289014
Prod4 4741602975
Prod5 906485301012
And another one, df2, which looks like:
Brand old_ID
Brand1 13554998333
Brand2 17432309490
Brand3 14300012960
Brand4 14741602975
Brand5 2710420383988
To give some context, the data comes from two different databases where product codes (columns new_ID
and old_ID
respectively) are represented slightly differently. For example, Prod 2 and Brand 2 are the same with one extra digit in the old_ID
column value compared to the value in new_ID
. Same for Prod 4 and Brand 4. Note that all the codes in new_ID
are not in old_ID
.
Edit: Also note that the difference between the new_ID
and old_ID
value is not always that of the leading digit. Sometimes the first and last digit of an old_ID
value is dropped to get new_ID
value.
So I want to find all the rows in df2 which contain the products in df1 using the fields old_ID
and new_ID
. I could think of matching the new_ID
value in old_ID
using grepl
. But I think that can be done only one at a time.
Is there a better way to find a match of a vector of new_ID
values in the old_ID column
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果确定差异只是一个额外的指导数字,那么您可以删除领先的额外数字并使用合并来查找匹配项:
编辑
:
如果总是有额外的数字和可能的其他东西,那么FuzzyJoin可以完成这项工作。
我添加了prod6和brand6和另一个最后的数字
If for sure the difference is only an extra leading digit then you can just remove the leading extra digit and use merge to find matches:
which gives:
EDIT:
If there s always a leading extra digit and potentially other stuff then fuzzyjoin can do the job.
I added Prod6 and Brand6 with another final digit