不同模糊比率的模糊匹配
我有两个大数据集。 df1约为1m行,df2约为10m行。我需要从 df2 中查找 df1 中的行的匹配项。
我已经单独发布了这个问题的原始版本。请参阅此处 。 @laurent 回答得很好,但我现在有一些额外的细节。我现在想要:
获取最终匹配数据帧的列中每个 fname 和 lname 的模糊比率
编写代码,将 fname 的模糊比率设置为 > 60,而 lname 的模糊比率设置为 > 75。换句话说,如果fname的fuzz_ratio>60并且lname的fuzz_ratio>75,则发生真正的匹配;否则不是真正的匹配。如果 fname 的模糊比率 == 80 而 lname 的模糊比率 == 60,则匹配不正确。虽然我知道这可以从 (1) 中作为事后过滤来完成,但在不同匹配的编码阶段执行此操作是有意义的。
我在这里发布了我的数据示例。 @laurent 针对原始问题的解决方案可以在上面的链接中找到。
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
预期输出为:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
上述预期输出的注释:根据我指定的模糊比率,Bradley Cupper 与 Bradley Cooper 不匹配。与布拉德利·库珀 (Bradley Cooper) 更匹配的是 M·布拉德·库珀 (M Brad Couper)。同样,托马斯·克鲁斯与汤姆·克鲁斯匹配,而不是与托马斯·克鲁兹匹配。
我主要是Stata的用户(哈哈),reclink2 ado文件理论上可以做到上述,即如果Stata可以处理数据的大小。然而,考虑到我拥有的数据量,几小时后甚至没有任何启动。
I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by @laurent but I have some added specificities now. I would now like to:
Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe
Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.
I post here an example of my data. The solution by @laurent for the original problem can be found in the above link.
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
Expected output is:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname": {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.
I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一种方法:
Here is one way to do it: