两个 DataFrame 之间的不匹配行

发布于 01-11 13:24 字数 237 浏览 5 评论 0原文

我是 Python 新手,想寻求您的帮助。 我想找出 2 个数据帧(即 df1 和 df2)之间有数千行的不匹配行。它们都包含相同数量且具有相同名称的列。

df2 有 10 个比 df1 少的条目,我正在尝试找出它们是什么。 我已尝试 pd.concat([df1,df2]).drop_duplicates(keep=False) 但它返回零结果。

可能是什么原因? 任何帮助/建议将不胜感激。 多谢。

I am new to Python and would like to seek you help on this please.
I would like to find out the non-matching rows between 2 dataframes i.e. df1 and df2 with thousands of rows. They both contain the same number of columns with the same name.

df2 has 10 entries lesser than df1 which I am trying to find out what they are.
I have tried pd.concat([df1,df2]).drop_duplicates(keep=False) but it returns zero result.

What could be the reason?
Any help/advice would be much appreciated.
Thanks a lot.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我的奇迹2025-01-18 13:24:45

以下代码将删除 df1 中存在于 df2 中的行

df1[~df1.isin(df2)]

The following code will remove rows in df1 that are present in df2

df1[~df1.isin(df2)]
攒眉千度2025-01-18 13:24:45

concat 组合两个框架。您正在尝试找出两个框架之间的差异。这可以通过比较来完成。正如文档示例所示,给定这两个框架:

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

您可以使用 compare 找到不同的行:

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

在这种情况下,compare 为每一行有差异的行返回一行,并且仅实际上不同的列。

compare 也可以返回相等的值或所有原始值:

df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0

concat combines two frames. You're trying to find the difference between two frames. This can be done with compare. As the doc example shows, given these two frames:

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

You can find the different rows with compare :

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

In this case compare returns one row for every row that has a difference and only the columns that are actually different.

compare can return the equal values or all original values as well:

df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

or

df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0
娇柔作态2025-01-18 13:24:45

仅仅因为您的条目少了 10 个,并不意味着您会找到重复项。您可能在第一个数据框中已经有重复项。

演示:

# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
                    'col2': list('FFFGHIJ'),
                    'col3': list('1112345')})

# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
                    'col2': list('FGHIJ'),
                    'col3': list('12345')})

您的代码:

>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []

尝试:

>>> len(df1.drop_duplicates())
5

>>> len(df2.drop_duplicates())
5

Just because you have 10 less entries doesn't mean you're going to find duplicates. You probably already have duplicates inside the first dataframe.

Demo:

# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
                    'col2': list('FFFGHIJ'),
                    'col3': list('1112345')})

# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
                    'col2': list('FGHIJ'),
                    'col3': list('12345')})

Your code:

>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []

Try:

>>> len(df1.drop_duplicates())
5

>>> len(df2.drop_duplicates())
5
旧时模样2025-01-18 13:24:45

假设 df1 和 df2 都是 Pandas Dataframe,以下代码对于匹配的行返回 True,对于另一行返回 false:

print((df1 == df2).any(1))

如果需要检查所有行中的每一列,请尝试以下操作:

print((df1 == df2).stack())

Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:

print((df1 == df2).any(1))

If needed to check each and every columns in all the rows, try this:

print((df1 == df2).stack())
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文