如何使用Pandas Groupby过滤此数据框架?

发布于 2025-01-29 13:39:25 字数 1664 浏览 4 评论 0原文

使用python,您如何使用群组来过滤此数据

集启动

First     Last     Location      ID1              ID2             First3   Last3
John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee
Stac      Wong     Aurora        StacAurora       LeeAurora       Sta      Won
Stac      Lee      Newmarket     StacNewmarket    LeeNewmarket    Sta      Lee
Steve     Smith    Toronto       SteveToronto     SmithToronto    Ste      Smi
John      Jones    Toronto       JohnToronto      JonesToronto    Joh      Jon

我该如何完成,以便在接受两个条件的地方,过滤所有不符合这两个条件

  • ID1的所有其他条件 - 与另一个ID1匹配另一个ID1,而最后3则是相同的
  • ID2-匹配另一个ID2,前3个是相同的

末端

 First     Last     Location      ID1              ID2             First3   Last3
 John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
 Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
 Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
 Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
 Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
 Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee

Using Python how can you use a group-by to filter this dataset

Start

First     Last     Location      ID1              ID2             First3   Last3
John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee
Stac      Wong     Aurora        StacAurora       LeeAurora       Sta      Won
Stac      Lee      Newmarket     StacNewmarket    LeeNewmarket    Sta      Lee
Steve     Smith    Toronto       SteveToronto     SmithToronto    Ste      Smi
John      Jones    Toronto       JohnToronto      JonesToronto    Joh      Jon

How can I make it so that where either the two conditions are accepted, filtering everything else that doesn't meet these two criteria

  • ID1 - Matches another ID1 and the Last3 are the same
  • ID2 - Matches another ID2 and the First 3 are the same

End

 First     Last     Location      ID1              ID2             First3   Last3
 John      Smith    Toronto       JohnToronto      SmithToronto    Joh      Smi
 Joh       Smith    Toronto       JohToronto       SmithToronto    Joh      Smi
 Steph     Sax      Vancouver     StephVancouver   SaxVancouver    Ste      Sax
 Steph     Sa       Vancouver     StephVancouver   SaxeVancouver   Ste      Sax 
 Stacy     Lee      Markham       StacyMarkham     LeeMarkham      Sta      Lee
 Stac      Lee      Markham       StacMarkham      LeeMarkham      Sta      Lee

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蓝天白云 2025-02-05 13:39:26

您可以使用:

df = pd.DataFrame({
    'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
    'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
    'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham', 
                'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
    'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
          'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
    'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
          'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
    
    'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
    'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]

You may use:

df = pd.DataFrame({
    'First':['John', 'Joh', 'Steph', 'Steph', 'Stacy', 'Stac', 'Stac', 'Stac', 'Steve', 'John'],
    'Last':['Smith', 'Smith', 'Sax', 'Sa', 'Lee', 'Lee', 'Wong', 'Lee', 'Smith', 'Jones'],
    'Location':['Toronto', 'Toronto', 'Vancouver', 'Vancouver', 'Markham', 
                'Markham', 'Aurora', 'Newmarket', 'Toronto', 'Toronto'],
    'ID1':['JohnToronto', 'JohnToronto', 'StephVancouver', 'StephVancouver', 'StacyMarkham',
          'StacyMarkham','StacAurora', 'StacNewmarket','SteveToronto','JohnToronto'],
    'ID2':['SmithToronto','SmithToronto','SaxVancouver','SaxVancouver',
          'LeeMarkham','LeeMarkham','LeeAurora','LeeNewmarket','SmithToronto','JonesToronto'],
    
    'First3':['Joh','Joh','Ste','Ste','Sta','Sta','Sta','Sta','Ste','Joh'],
    'Last3':['Smi','Smi','Sax','Sax','Lee','Lee','Won','Lee','Smi','Jon']
})
m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df[m1].duplicated(subset=['ID2','First3'],keep=False)
df = df[m1 & m2]
黯然 2025-02-05 13:39:25

根据澄清问题声明的评论 -

尝试分组ID1或ID2。然后取决于Last3 col和first3 col分别相同的ID滤波

请尝试此方法 -

#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index

#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index

#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
   First   Last   Location             ID1            ID2 First3 Last3
0   John  Smith    Toronto     JohnToronto   SmithToronto    Joh   Smi
1    Joh  Smith    Toronto      JohToronto   SmithToronto    Joh   Smi
2  Steph    Sax  Vancouver  StephVancouver   SaxVancouver    Ste   Sax
3  Steph     Sa  Vancouver  StephVancouver  SaxeVancouver    Ste   Sax
4  Stacy    Lee    Markham    StacyMarkham     LeeMarkham    Sta   Lee
5   Stac    Lee    Markham     StacMarkham     LeeMarkham    Sta   Lee

编辑:修改@Somedude提供的上述答案,您可以将其作为2个独立条件而无需组的情况和也要在它们之间进行或

m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]

Based on comment for clarification of the problem statement -

trying to groupby ID1 or ID2. And then depending which ID filter if Last3 col and First3 Col are the same respectively

Try this approach -

#group by ID1 and check if duplicates in last3. Then extract the index number that satisfies condition
c1 = df.groupby('ID1').apply(pd.DataFrame.duplicated, subset=['Last3'], keep=False)
c1_idx = c1[c1].droplevel(0).index

#group by ID2 and check if duplicates in first3. Then extract the index number that satisfies condition
c2 = df.groupby('ID2').apply(pd.DataFrame.duplicated, subset=['First3'], keep=False)
c2_idx = c2[c2].droplevel(0).index

#take a union of the 2 indexes and then ..
#filter dataframe for the indexes that meet the 2 independent conditions
output = df.iloc[c1_idx.union(c2_idx)]
print(output)
   First   Last   Location             ID1            ID2 First3 Last3
0   John  Smith    Toronto     JohnToronto   SmithToronto    Joh   Smi
1    Joh  Smith    Toronto      JohToronto   SmithToronto    Joh   Smi
2  Steph    Sax  Vancouver  StephVancouver   SaxVancouver    Ste   Sax
3  Steph     Sa  Vancouver  StephVancouver  SaxeVancouver    Ste   Sax
4  Stacy    Lee    Markham    StacyMarkham     LeeMarkham    Sta   Lee
5   Stac    Lee    Markham     StacMarkham     LeeMarkham    Sta   Lee

EDIT: Modifying the above answer provided by @SomeDude, you can run this as 2 independent conditions without a groupby and take an OR between them as well -

m1 = df.duplicated(subset=['ID1','Last3'],keep=False)
m2 = df.duplicated(subset=['ID2','First3'],keep=False)
df[m1 | m2]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文