分组列数据列
我汇总了一个列和获取数组列的总和。
df2 = pd.DataFrame([[1,'IT', np.array([2, 5, 3])],
[1, 'IT', np.array([2, 5, 3])],
[1,'Sport', np.array([2, 5, 3, 5, 3])],
[2,'Sport', np.array([2, 5, 3])],
[2, 'IT', np.array([2, 5, 3])],
[2, 'Sport',np.array([2, 5, 3, 5, 3])]
],
columns=['doc_id','type', 'topic_dist'])
grouped = df2.groupby(['doc_id','type'])
aggregate = list((k, v["topic_dist"].apply(pd.Series).sum().to_list()) for k, v in grouped)
df_results = pd.DataFrame(aggregate, columns=['grouped_columns','topic_dist'])
并得到这个结果。
grouped_columns topic_dist
0 (1, IT) [4, 10, 6]
1 (1, Sport) [2, 5, 3, 5, 3]
2 (2, IT) [2, 5, 3]
3 (2, Sport) [4.0, 10.0, 6.0, 5.0, 3.0]
预期结果
doc_id type topic_dist
0 1 IT [4, 10, 6]
1 1 Sport [2, 5, 3, 5, 3]
2 2 IT [2, 5, 3]
3 2 Sport [4.0, 10.0, 6.0, 5.0, 3.0]
是否有任何分组列的想法?
I aggregate a columns and a get the sum of array column.
df2 = pd.DataFrame([[1,'IT', np.array([2, 5, 3])],
[1, 'IT', np.array([2, 5, 3])],
[1,'Sport', np.array([2, 5, 3, 5, 3])],
[2,'Sport', np.array([2, 5, 3])],
[2, 'IT', np.array([2, 5, 3])],
[2, 'Sport',np.array([2, 5, 3, 5, 3])]
],
columns=['doc_id','type', 'topic_dist'])
grouped = df2.groupby(['doc_id','type'])
aggregate = list((k, v["topic_dist"].apply(pd.Series).sum().to_list()) for k, v in grouped)
df_results = pd.DataFrame(aggregate, columns=['grouped_columns','topic_dist'])
and a get this result.
grouped_columns topic_dist
0 (1, IT) [4, 10, 6]
1 (1, Sport) [2, 5, 3, 5, 3]
2 (2, IT) [2, 5, 3]
3 (2, Sport) [4.0, 10.0, 6.0, 5.0, 3.0]
expected result
doc_id type topic_dist
0 1 IT [4, 10, 6]
1 1 Sport [2, 5, 3, 5, 3]
2 2 IT [2, 5, 3]
3 2 Sport [4.0, 10.0, 6.0, 5.0, 3.0]
any ideas to split the grouped Columns?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
设置索引来实现此目的:
pd.multiindex.from_tuples
您可以通过使用
/code>作为速记在这里
df_results
提高可读性。You can achieve this by setting the index using
pd.MultiIndex.from_tuples
as follows:Or, if you would like them to be regular columns instead of a multiindex:
Note that I am using
df
as shorthand here fordf_results
to improve readability.