pandas:基于多列将行追加到相似行下的另一个数据框
我问了一个非常类似的问题这里,但想知道如果一个人必须依赖多个执行追加的列。 因此,数据帧如下所示,
import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
'col4': ['min', 'max', 'mid','min'],
'col5': ['min', 'max', 'max','max']}
d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
'col4': ['min', 'max', 'max'],
'col5': ['max', 'min', 'mid']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
所以这一次,仅当所有 col1、col2、col3 中的行都相似时,我才会将 df2 中的行附加到 df1 中的相似行下。所以输出是,
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
所以我尝试了以下操作,
df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)
但出现此错误,
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
I asked quite a similar question here but was wondering if there is a way to tackle the issue if one has to rely on multiple columns to perform the append.
So the dataframes look as follows,
import pandas as pd
d1 ={'col1': ['I ate dinner','I ate dinner', 'the play was inetresting','the play was inetresting'],
'col2': ['I ate dinner','I went to school', 'the play was inetresting for her','the gold is shining'],
'col3': ['I went out','I did not stay at home', 'the play was inetresting for her','the house is nice'],
'col4': ['min', 'max', 'mid','min'],
'col5': ['min', 'max', 'max','max']}
d2 ={'col1': ['I ate dinner',' the glass is shattered', 'the play was inetresting'],
'col2': ['I ate dinner',' the weather is nice', 'the gold is shining'],
'col3': ['I went out',' the house was amazing', 'the house is nice'],
'col4': ['min', 'max', 'max'],
'col5': ['max', 'min', 'mid']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
so this time, i would like to append the rows from df2 under similar rows in df1 only if the rows are similar in all col1, col2, col3. so the output is,
col1 col2 col3 col4 col5
0 I ate dinner I ate dinner I went out min min
1 I ate dinner I ate dinner I went out min max
2 the play was inetresting the gold is shining the house is nice min max
3 the play was inetresting the gold is shining the house is nice max mid
so I tried the following,
df = pd.concat(df1[df1.set_index(['col1','col2','col3']).index.isin(df2.set_index(['col1','col2','col3']).index)]).sort_values(df1.set_index(['col1','col2','col3']).index, ignore_index=True)
but I get this error,
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
另一种解决方案是使用
pd.merge
和pd.wide_to_long
:输出:
逐步
Another solution is to use
pd.merge
andpd.wide_to_long
:Output:
Step by step
好的,我意识到自己的错误,并将在这里发布答案,以防任何人都感兴趣,(答案基于问题中的链接)
ok, I realized my own mistake and will post an answer here in case it might be interesting for anyone, (the answer is based on the link in the question)
我强烈建议您提出数字最小示例,而不是基于文本的示例。更容易阅读,更容易理解。话虽这么说,如果我理解正确的话,您想要 df1 的每一行:
当然,我们可以讨论 df1 中重复的情况,以及您希望如何处理它们。然后,我们可以编写两种解决方案,一种使用 for 循环,另一种使用 Pandas 的函数式编程(取决于您的技能、习惯和其他偏好)。
for 循环方法
让我们假设 df1 中没有重复项,那么:
一种函数式方法
我还没有时间编写适当的解决方案,但我猜它意味着 groupby、apply 和一堆 Pandas 相关函数。对于每一行 x,我们仍然使用 df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)] 来选择 df2 等于 x 的行。
我们只是“循环”所有行。设计的工具可以是 groupby。那么我们就不再关心重复项了。
如果没有找到 df2 的行,还需要做一些工作才能不追加行,并清理输出。
I strongly advice you to propose numerics minimum example instead of text based ones. Easier to read, easier to understand. That being said, if I understand correctly, you want for each row of df1 :
Of course, we can discuss the case of duplicates in df1, and how you want to process them. Then, we can write two solutions, one using a for loop and an other using functional programming from Pandas (depends on your skills, habits, and other preferences).
A for-loop Approach
Let's assume there is no duplicate in df1, then :
A functional approach
I didn't have time yet to write a proper solution, but I guess it implies groupby, apply and a bunch of Pandas related functions. For each row x, we still used
df2[df2.eq(x.loc[cols]).loc[:, cols].all(axis=1)]
to select df2's rows equal to x.We just to "loop" over all rows. A designed tool could be groupby. Then we don't care about duplicates anymore.
There are still some work to do to not append rows if no df2's row was found, and to clean up the output.