需要帮助清洁数据-Python多重组
我有一个巨大的数据框,它们是名称拼写的许多错别字。
我一直在处理的数据框架
First Last Location ID1 ID2
John Smith Calgary JohnCalgary SmithCalgary
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Victor Jones Toronto VictorToronto JonesToronto
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
Stacy Lee Calgary StacyCalgary LeeCalgary
是代码,
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Saxe','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markham','Calgary'],
'ID1': ['JohnCalgary', 'JohnToronto', 'JohToronto', 'StephVancouver','StephVancouver',VictorToronto', 'StacyMarkham', 'StacMarkham',StacyCalgary'],
'ID2':['SmithCalgary','SmithToronto', 'SmithToronto', 'SaxeVancouver', 'SaVancouver', 'JonesToronto', 'LeeMarkham', 'LeeMarkham', 'LeeCalgary']
}
甚至尝试使用ID1和ID2进行组比,如果ID2匹配另一个ID2,而名字则不同,并且vice vice vice vice vice vice -vice -vice vice -vice vice vice仍然很多 - 我需要过滤更多。
我该怎么做,以便在
- id1-匹配另一个ID1,最后3个是同一
- ID2-与另一个ID2匹配,而前3个是相同的
所需的,请缩小数据集,
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
这就是我到目前为止尝试的
m1 = df10.groupby('ID1')['ID2'],('Last3').transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1']('First3').transform('nunique').gt(1)
out10 = df10[m1|m2]
I have a huge dataframe and they are many typos for the spelling of names.
This dataframe I been working on
First Last Location ID1 ID2
John Smith Calgary JohnCalgary SmithCalgary
John Smith Toronto JohnToronto SmithToronto
Joh Smith Toronto JohToronto SmithToronto
Steph Sax Vancouver StephVancouver SaxVancouver
Steph Sa Vancouver StephVancouver SaVancouver
Victor Jones Toronto VictorToronto JonesToronto
Stacy Lee Markham StacyMarkham LeeMarkham
Stac Lee Markham StacMarkham LeeMarkham
Stacy Lee Calgary StacyCalgary LeeCalgary
This is the code
data = {'First':['John', 'John', 'Joh', 'Steph','Steph','Victor','Stacy','Stac','Stacy'],
'Last':['Smith','Smith','Smith','Sax','Saxe','Jones','Lee','Lee','Lee'],
'Location':['Caglary','Toronto','Toronto','Vancouver','Vancouver','Toronto','Markham','Markham','Calgary'],
'ID1': ['JohnCalgary', 'JohnToronto', 'JohToronto', 'StephVancouver','StephVancouver',VictorToronto', 'StacyMarkham', 'StacMarkham',StacyCalgary'],
'ID2':['SmithCalgary','SmithToronto', 'SmithToronto', 'SaxeVancouver', 'SaVancouver', 'JonesToronto', 'LeeMarkham', 'LeeMarkham', 'LeeCalgary']
}
Even trying to do a groupby using ID1 and ID2 where if the ID2 matches another ID2 and the first name is different and vice verse - there are still so many typos so I need to filter more.
How can I make it so that where
- ID1 - Matches another ID1 and the Last3 are the same
- ID2 - Matches another ID2 and the First 3 are the same
Desired, narrowed down dataset
First Last Location ID1 ID2 First3 Last3
John Smith Toronto JohnToronto SmithToronto Joh Smi
Joh Smith Toronto JohToronto SmithToronto Joh Smi
Steph Sax Vancouver StephVancouver SaxVancouver Ste Sax
Steph Sa Vancouver StephVancouver SaxeVancouver Ste Sax
Stacy Lee Markham StacyMarkham LeeMarkham Sta Lee
Stac Lee Markham StacMarkham LeeMarkham Sta Lee
This is what I was trying so far
m1 = df10.groupby('ID1')['ID2'],('Last3').transform('nunique').gt(1)
m2 = df10.groupby('ID2')['ID1']('First3').transform('nunique').gt(1)
out10 = df10[m1|m2]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论