pandas：基于多列将行追加到相似行下的另一个数据框

发布于 2025-01-14 16:11:53 字数 1691 浏览 4 评论 0原文

我问了一个非常类似的问题这里，但想知道如果一个人必须依赖多个执行追加的列。因此，数据帧如下所示，

    import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
 'col4': ['min', 'max', 'mid','min'],
 'col5': ['min', 'max', 'max','max']}

d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
     'col4': ['min', 'max', 'max'],
     'col5': ['max', 'min', 'mid']}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

所以这一次，仅当所有 col1、col2、col3 中的行都相似时，我才会将 df2 中的行附加到 df1 中的相似行下。所以输出是，

    col1         col2          col3       col4 col5
 0 I ate dinner  I ate dinner  I went out  min  min
 1 I ate dinner  I ate dinner  I went out  min  max
 2  the play was inetresting the gold is shining  the house is nice  min  max
 3  the play was inetresting the gold is shining  the house is nice  max  mid

所以我尝试了以下操作，

df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)

但出现此错误，

 TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

原文

I asked quite a similar question here but was wondering if there is a way to tackle the issue if one has to rely on multiple columns to perform the append.
So the dataframes look as follows,

    import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
 'col4': ['min', 'max', 'mid','min'],
 'col5': ['min', 'max', 'max','max']}

d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
     'col4': ['min', 'max', 'max'],
     'col5': ['max', 'min', 'mid']}

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

so this time, i would like to append the rows from df2 under similar rows in df1 only if the rows are similar in all col1, col2, col3. so the output is,

    col1         col2          col3       col4 col5
 0 I ate dinner  I ate dinner  I went out  min  min
 1 I ate dinner  I ate dinner  I went out  min  max
 2  the play was inetresting the gold is shining  the house is nice  min  max
 3  the play was inetresting the gold is shining  the house is nice  max  mid

so I tried the following,

df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)

but I get this error,

 TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里寻她 2025-01-21 16:11:53

另一种解决方案是使用 pd.merge 和 pd.wide_to_long：

out = (
    pd.wide_to_long(
        pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
        stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
      .sort_index().reset_index(drop=True)[df1.columns]
)

输出：

>>> out
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

逐步

# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
   index                      col1                 col2               col3 col4_x col5_x col4_y col5_y
0      0              I ate dinner         I ate dinner         I went out    min    min    min    max
1      1  the play was inetresting  the gold is shining  the house is nice    min    max    max    mid

# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
                        col3                 col2                      col1 col4 col5
index val                                                                            
0     x           I went out         I ate dinner              I ate dinner  min  min
1     x    the house is nice  the gold is shining  the play was inetresting  min  max
0     y           I went out         I ate dinner              I ate dinner  min  max
1     y    the house is nice  the gold is shining  the play was inetresting  max  mid

# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

Another solution is to use pd.merge and pd.wide_to_long:

out = (
    pd.wide_to_long(
        pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index(),
        stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
      .sort_index().reset_index(drop=True)[df1.columns]
)

Output:

>>> out
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

Step by step

# Step 1: merge
>>> out = pd.merge(df1, df2, how='inner', on=['col1', 'col2', 'col3']).reset_index()
   index                      col1                 col2               col3 col4_x col5_x col4_y col5_y
0      0              I ate dinner         I ate dinner         I went out    min    min    min    max
1      1  the play was inetresting  the gold is shining  the house is nice    min    max    max    mid

# Step 2: wide_to_long
>>> out = pd.wide_to_long(out, stubnames=['col4', 'col5'], i='index', j='val', sep='_', suffix=r'[xy]')
                        col3                 col2                      col1 col4 col5
index val                                                                            
0     x           I went out         I ate dinner              I ate dinner  min  min
1     x    the house is nice  the gold is shining  the play was inetresting  min  max
0     y           I went out         I ate dinner              I ate dinner  min  max
1     y    the house is nice  the gold is shining  the play was inetresting  max  mid

# Step 3: reorder dataframe
>>> out = out.sort_index().reset_index(drop=True)[df1.columns]
                       col1                 col2               col3 col4 col5
0              I ate dinner         I ate dinner         I went out  min  min
1              I ate dinner         I ate dinner         I went out  min  max
2  the play was inetresting  the gold is shining  the house is nice  min  max
3  the play was inetresting  the gold is shining  the house is nice  max  mid

回复收藏 0 原文

陌伤浅笑 2025-01-21 16:11:53

好的，我意识到自己的错误，并将在这里发布答案，以防任何人都感兴趣，（答案基于问题中的链接）

print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))

ok, I realized my own mistake and will post an answer here in case it might be interesting for anyone, (the answer is based on the link in the question)

print(pd.concat([df1, df2[df2.set_index(['col1','col2','col3']).index.isin(df1.set_index(['col1','col2','col3']).index)]]).sort_values(['col1','col2','col3'], ignore_index=True))

回复收藏 0 原文

音栖息无 2025-01-21 16:11:53

我强烈建议您提出数字最小示例，而不是基于文本的示例。更容易阅读，更容易理解。话虽这么说，如果我理解正确的话，您想要 df1 的每一行：

检查 df2 中是否有一些行在某些列上具有相同的值。
将这些行附加到 df1，位于该行的后面。

当然，我们可以讨论 df1 中重复的情况，以及您希望如何处理它们。然后，我们可以编写两种解决方案，一种使用 for 循环，另一种使用 Pandas 的函数式编程（取决于您的技能、习惯和其他偏好）。

for 循环方法

让我们假设 df1 中没有重复项，那么：

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

cols = ["col1", "col2", "col3"]

## create a new dataframe
new_df = pd.DataFrame()

## loop over all rows of df1
for _, row1 in df1.iterrows():

    ## check for equality with df2, we need all elements cols being equal to select these rows
    is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)

    ## if at least one row of df2 is equal to row, append them
    if is_eq_to_row_1.any():
        ## first append the row1
        new_df = new_df.append(row1) 
        ## then all rows of df2 equal to row1
        new_df = new_df.append(df2[is_eq_to_row_1])

一种函数式方法

我还没有时间编写适当的解决方案，但我猜它意味着 groupby、apply 和一堆 Pandas 相关函数。对于每一行 x，我们仍然使用 df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)] 来选择 df2 等于 x 的行。

我们只是“循环”所有行。设计的工具可以是 groupby。那么我们就不再关心重复项了。

new_df = df1.groupby(cols). \
         apply(lambda x : pd.concat([x, 
                                     df2[df2.eq(x.iloc[0, :].loc[cols]). \
                                         loc[:, cols].all(axis=1)]]))

如果没有找到 df2 的行，还需要做一些工作才能不追加行，并清理输出。

I strongly advice you to propose numerics minimum example instead of text based ones. Easier to read, easier to understand. That being said, if I understand correctly, you want for each row of df1 :

check if there are some rows from df2 having the same value on some cols.
append these rows to df1, right behind the said row.

Of course, we can discuss the case of duplicates in df1, and how you want to process them. Then, we can write two solutions, one using a for loop and an other using functional programming from Pandas (depends on your skills, habits, and other preferences).

A for-loop Approach

Let's assume there is no duplicate in df1, then :

df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

cols = ["col1", "col2", "col3"]

## create a new dataframe
new_df = pd.DataFrame()

## loop over all rows of df1
for _, row1 in df1.iterrows():

    ## check for equality with df2, we need all elements cols being equal to select these rows
    is_eq_to_row_1 = df2.eq(row1.loc[cols]).loc[:, cols].all(axis=1)

    ## if at least one row of df2 is equal to row, append them
    if is_eq_to_row_1.any():
        ## first append the row1
        new_df = new_df.append(row1) 
        ## then all rows of df2 equal to row1
        new_df = new_df.append(df2[is_eq_to_row_1])

A functional approach

I didn't have time yet to write a proper solution, but I guess it implies groupby, apply and a bunch of Pandas related functions. For each row x, we still used df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)] to select df2's rows equal to x.

We just to "loop" over all rows. A designed tool could be groupby. Then we don't care about duplicates anymore.

new_df = df1.groupby(cols). \
         apply(lambda x : pd.concat([x, 
                                     df2[df2.eq(x.iloc[0, :].loc[cols]). \
                                         loc[:, cols].all(axis=1)]]))

There are still some work to do to not append rows if no df2's row was found, and to clean up the output.

回复收藏 0 原文

~没有更多了~