通过比较两个dataframes pandas中的列来获取唯一值的有效方法
我有两个数据范围,类似于下面:
df1:
date col1 col2 col3
15-5-2022 ABC 1 PQR
16-5-2022 BCD 2 ABC
17-5-2022 CDE 4 XYZ
df2:
date col1 col2 col3
5-4-2022 XYZ 1 ABC
6-4-2022 PQR 2 ABC
7-4-2022 BCD 4 PQR
我的任务是获取DF2.COL1中的唯一值总数,但在df1.col1中没有。我这样做的方式是首先创建DF1的所有COL1唯一值的列表,然后从DF2创建一个唯一值,然后比较这两个列表,并创建第三个列表与第二列表中的内容,但不是第一个列表。由于我需要最终列表中的项目计数,因此我在第三列表上进行了LEN。我的代码如下:
list1 = df1.col1.unique()
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
这正在完成我的任务,但是花很长时间运行,可能是因为我的DFS很大。我想知道是否有一种更聪明,更有效的方法。感谢任何帮助
I have two dataframes, something like below:
df1:
date col1 col2 col3
15-5-2022 ABC 1 PQR
16-5-2022 BCD 2 ABC
17-5-2022 CDE 4 XYZ
df2:
date col1 col2 col3
5-4-2022 XYZ 1 ABC
6-4-2022 PQR 2 ABC
7-4-2022 BCD 4 PQR
My task is to get total number of unique values that are in df2.col1 but not in df1.col1. The way I am doing this is by creating first a list of all col1 unique values from df1 and then from df2 and then comparing these two lists and creating a third list with what exists in second list but not the first. Since I need the count of items in the final list, I am doing a len on third list. My code is like below:
list1 = df1.col1.unique()
list2 = df2.col1.unique()
list3 = [x for x in list2 if x not in list1]
num_list3 = len(list3)
This is getting my task done, but taking a very long time to run, probably because my dfs are quite big. I was wondering if there is a smarter and more efficient way of doing this please. I would appreciate any help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用:
输出:
数组(['xyz','pqr'],dtype = object)
or sets:oumpts:
output:
{'pqr','xyz'}
>Use:
output:
array(['XYZ', 'PQR'], dtype=object)
Or, with sets:
output:
{'PQR', 'XYZ'}
我遇到了类似的问题,但更困难。我的问题是比较唯一组合并获得
df1
和df2
之间的差异。在这里,我发布了两个解决方案,以防您需要它们。解决方案密钥想法:使用
concat
,groupby
或MERGE
的技巧。如果您想获得两个数据范围的关节或互助。通过获取
df1
的唯一,df2
,称为u1
和u2
。您concatu1
和u2
,然后使用groupby
来计数发生数量。如果超过1,则在u1
和u2
中出现。如果是1,则显示在两个u1
或u2
中。如果仅要从
df1
或df2
获得唯一内容,则使用Merge
trick with optionindoce
indistator = true
复制数据集:
问题1:单列比较
问题2:多列组合
df1
或df2
仅这应该加快您的旧代码。
希望这有所帮助
I encounter a similar problem but more difficult. My problem is to compare the unique combination and get the differences between
df1
anddf2
. Here I post the 2 solutions in case you need them.Solution key idea: Using a trick of
concat
,groupby
ormerge
.If you want to get the joint or the mutual of the two dataframes. By getting unique of
df1
, anddf2
, calledu1
andu2
. You concatu1
, andu2
, then usegroupby
to count for the number of occurance. If more than 1 then it appears in bothu1
andu2
. If it is 1, then it appears in one of the twou1
oru2
.If you want to get the uniques from
df1
ordf2
only, then usemerge
trick with optionindicator=True
Data set for replication:
Problem 1: a single column comparison
Problem 2: multiple columns combination
If you need uique of
df1
ordf2
onlyThis should speed up your old code.
Hope this help