比较 df,包括数据的详细洞察

发布于 2025-01-13 01:58:17 字数 1038 浏览 0 评论 0原文

我有一个 python 项目:

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

我想比较两个 df。结果应该是 df,我可以在其中看到名称、城市、许可证和金额。通常,df_testR 和 df_testF 应该完全相同。 如果不一样,我想看看 Amount_R 与 Amount_F 的差异。

我提到: pandas 中两个数据帧之间的差异

但我收到一张表格仅限 TRUE 和 FALSE:

名称城市许可证金额
TrueTrueTrueFalse

但我想获得一个仅列出发生差异的行的表格,并以如下方式显示数据之间的差异:

名称城市LicenseAmount_RAmount_F
PaulNYYES200500。

此处,两个表都包含 PAUL、NY 且 License = Yes,但表 R 包含 200 作为金额,表 F 包含 500 作为金额。我希望从我的分析中收到一个表格,该表格仅捕获发生此类差异的行。

有人可以帮忙吗?

I'm having a python project:

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.

I referred to: Diff between two dataframes in pandas

But I receive a table with TRUE and FALSE only:

NameCityLicenceAmount
TrueTrueTrueFalse

But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:

NameCityLicenceAmount_RAmount_F
PaulNYYES200500.

Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.

Could someone help?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

各空 2025-01-20 01:58:17
import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

首先找到缺失的行并打印它们:

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

然后删除这些行并合并:

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

现在删除具有相同数量的行:

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

我假设 DF 已排序。
如果您正在处理非常大的 DF,那么最好首先过滤 DF,以使合并速度更快。

import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

First find the missing rows and print them:

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

Then drop these rows and merge:

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

Now remove the rows that have the same amount:

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文