通过比较两个dataframes pandas中的列来获取唯一值的有效方法

发布于 2025-02-14 01:49:06 字数 855 浏览 6 评论 0原文

我有两个数据范围,类似于下面:

  df1:

   date             col1          col2         col3
 15-5-2022          ABC            1            PQR
 16-5-2022          BCD            2            ABC
 17-5-2022          CDE            4            XYZ


  df2:

   date           col1          col2         col3
 5-4-2022          XYZ            1           ABC
 6-4-2022          PQR            2           ABC
 7-4-2022          BCD            4           PQR

我的任务是获取DF2.COL1中的唯一值总数,但在df1.col1中没有。我这样做的方式是首先创建DF1的所有COL1唯一值的列表,然后从DF2创建一个唯一值,然后比较这两个列表,并创建第三个列表与第二列表中的内容,但不是第一个列表。由于我需要最终列表中的项目计数,因此我在第三列表上进行了LEN。我的代码如下:

list1 = df1.col1.unique()    
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
 

这正在完成我的任务,但是花很长时间运行,可能是因为我的DFS很大。我想知道是否有一种更聪明,更有效的方法。感谢任何帮助

I have two dataframes, something like below:

  df1:

   date             col1          col2         col3
 15-5-2022          ABC            1            PQR
 16-5-2022          BCD            2            ABC
 17-5-2022          CDE            4            XYZ


  df2:

   date           col1          col2         col3
 5-4-2022          XYZ            1           ABC
 6-4-2022          PQR            2           ABC
 7-4-2022          BCD            4           PQR

My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:

list1 = df1.col1.unique()    
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
 

This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

爱人如己 2025-02-21 01:49:06

使用:

df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()

输出:数组(['xyz','pqr'],dtype = object)

or sets:oumpts:

set(df2['col1']) - set(df1['col1'])

output:{'pqr','xyz'}>

Use:

df2.loc[~df2['col1'].isin(df1['col1']), 'col1'].unique()

output: array(['XYZ', 'PQR'], dtype=object)

Or, with sets:

set(df2['col1']) - set(df1['col1'])

output: {'PQR', 'XYZ'}

傲影 2025-02-21 01:49:06

我遇到了类似的问题,但更困难。我的问题是比较唯一组合并获得df1df2之间的差异。在这里,我发布了两个解决方案,以防您需要它们。

解决方案密钥想法:使用concatgroupbyMERGE的技巧。

如果您想获得两个数据范围的关节或互助。通过获取df1的唯一,df2,称为u1u2。您concat u1u2,然后使用groupby来计数发生数量。如果超过1,则在u1u2中出现。如果是1,则显示在两个u1u2中。

如果仅要从df1df2获得唯一内容,则使用Merge trick with option indoce indistator = true

复制数据集:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'id':['a','a','b','b','c','d','e'],
                   'val': [1,2,1,3,4,5,6]})
df2 = pd.DataFrame({'id':['a','a','b','b','c','c','e','f','f','d'],
                   'val': [1,2,1,3,4,5,5,7,8,9]})

问题1:单列比较

# Getting unique in each dataframe and concat
u1 = pd.DataFrame(df1['id'].unique(), columns=['id'])
u2 = pd.DataFrame(df2['id'].unique(), columns=['id'])
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

问题2:多列组合

# Getting unique combination of `id` and `val` by using trick of `size()` in `groupby`:

u1 = df1.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u2 = df2.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

df1df2

# This will tell you the uniques combinations belong to df1 or df2 or both:    
pd.merge(u1, u2, how='outer', on='id', indicator=True)

这应该加快您的旧代码。

希望这有所帮助

I encounter a similar problem but more difficult. My problem is to compare the unique combination and get the differences between df1 and df2. Here I post the 2 solutions in case you need them.

Solution key idea: Using a trick of concat, groupby or merge.

If you want to get the joint or the mutual of the two dataframes. By getting unique of df1, and df2, called u1 and u2. You concat u1, and u2, then use groupby to count for the number of occurance. If more than 1 then it appears in both u1 and u2. If it is 1, then it appears in one of the two u1 or u2.

If you want to get the uniques from df1 or df2 only, then use merge trick with option indicator=True

Data set for replication:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'id':['a','a','b','b','c','d','e'],
                   'val': [1,2,1,3,4,5,6]})
df2 = pd.DataFrame({'id':['a','a','b','b','c','c','e','f','f','d'],
                   'val': [1,2,1,3,4,5,5,7,8,9]})

Problem 1: a single column comparison

# Getting unique in each dataframe and concat
u1 = pd.DataFrame(df1['id'].unique(), columns=['id'])
u2 = pd.DataFrame(df2['id'].unique(), columns=['id'])
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

Problem 2: multiple columns combination

# Getting unique combination of `id` and `val` by using trick of `size()` in `groupby`:

u1 = df1.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u2 = df2.groupby(['id', 'val']).size().reset_index().drop(columns=0)
u = pd.concat([u1,u2])

# Groupby, count the number of occurrence with function `size`:
u.groupby('id').size().reset_index()

# You can do the rest by your choice if you want the joint or the not joint part

If you need uique of df1 or df2 only

# This will tell you the uniques combinations belong to df1 or df2 or both:    
pd.merge(u1, u2, how='outer', on='id', indicator=True)

This should speed up your old code.

Hope this help

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文