比较 df，包括数据的详细洞察

发布于 2025-01-13 01:58:17 字数 1038 浏览 0 评论 0原文

我有一个 python 项目：

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

我想比较两个 df。结果应该是 df，我可以在其中看到名称、城市、许可证和金额。通常，df_testR 和 df_testF 应该完全相同。如果不一样，我想看看 Amount_R 与 Amount_F 的差异。

但我收到一张表格仅限 TRUE 和 FALSE：

名称	城市	许可证	金额
True	True	True	False

但我想获得一个仅列出发生差异的行的表格，并以如下方式显示数据之间的差异：

名称	城市	License	Amount_R	Amount_F
Paul	NY	YES	200	500。

此处，两个表都包含 PAUL、NY 且 License = Yes，但表 R 包含 200 作为金额，表 F 包含 500 作为金额。我希望从我的分析中收到一个表格，该表格仅捕获发生此类差异的行。

有人可以帮忙吗？

原文

I'm having a python project:

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.

I referred to: Diff between two dataframes in pandas

But I receive a table with TRUE and FALSE only:

Name	City	Licence	Amount
True	True	True	False

But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:

Name	City	Licence	Amount_R	Amount_F
Paul	NY	YES	200	500.

Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.

Could someone help?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

各空 2025-01-20 01:58:17

import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

首先找到缺失的行并打印它们：

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

然后删除这些行并合并：

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

现在删除具有相同数量的行：

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

我假设 DF 已排序。
如果您正在处理非常大的 DF，那么最好首先过滤 DF，以使合并速度更快。

import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

First find the missing rows and print them:

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

Then drop these rows and merge:

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

Now remove the rows that have the same amount:

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

回复收藏 0 原文

~没有更多了~

关于作者

诺曦

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

比较 df，包括数据的详细洞察

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

比较 df，包括数据的详细洞察

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。