根据 R 中的字符串距离匹配两列
我有两个非常大的数据框,其中包含人名。这两个数据框报告了这些人的不同信息(即 df1 报告了健康状况数据,df2 报告了社会经济状况数据)。两个数据框中都出现了一部分人。这是我感兴趣的样本。 我需要创建一个新的数据框,其中仅包含出现在两个数据集中的人员。然而,名称之间存在细微差别,主要是由于拼写错误造成的。
我的数据如下:
df1
name | smoker | age
"Joe Smith" | Yes | 43
"Michael Fagin" | Yes | 35
"Ellen McFarlan" | No | 55
...
...
df2
name | occupation | location
"Joe Smit" | Postdoc | London
"Joan Evans" | IT consultant | Bristol
"Michael Fegin" | Lawyer | Liverpool
...
...
我需要的是具有以下信息的第三个数据帧 df3:
df3
name1 | name2 | distance | smoker | age | occupation | location
"Joe Smith" | "Joe Smit" | a measure of their Jaro distance | Yes | 43 | Postdoc | London
"Michael Fagin" | "Michael Fegin" | a measure of their Jaro distance | Yes | 35 | Lawyer | Liverpool
...
...
到目前为止,我已经使用 stringdist 包来获取可能匹配的向量,但我正在努力使用这些信息来创建一个包含我需要的信息的新数据框。如果有人对此有想法,请提前非常感谢!
I have two very large dataframes containing names of people. The two dataframes report different information on these people (i.e. df1 reports data on health status and df2 on socio-economic status). A subset of people appears in both dataframes. This is the sample I am interested in.
I would need to create a new dataframe which includes only those people appearing in both datasets. There are, however, small differences in the names, mostly due to typos.
My data is as follows:
df1
name | smoker | age
"Joe Smith" | Yes | 43
"Michael Fagin" | Yes | 35
"Ellen McFarlan" | No | 55
...
...
df2
name | occupation | location
"Joe Smit" | Postdoc | London
"Joan Evans" | IT consultant | Bristol
"Michael Fegin" | Lawyer | Liverpool
...
...
What I would need is to have a third dataframe df3 with the following information:
df3
name1 | name2 | distance | smoker | age | occupation | location
"Joe Smith" | "Joe Smit" | a measure of their Jaro distance | Yes | 43 | Postdoc | London
"Michael Fagin" | "Michael Fegin" | a measure of their Jaro distance | Yes | 35 | Lawyer | Liverpool
...
...
So far I have worked with the stringdist package to get a vector of possible matches, but I am struggling to use this information to create a new dataframe with the information I need. Many thanks in advance should anyone have an idea for this!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
由 reprex 软件包 (v2.0.0) 创建于 2022 年 3 月 1 日
Created on 2022-03-01 by the reprex package (v2.0.0)