过滤数据框以获取树结构的叶子

发布于 2025-02-10 00:57:41 字数 972 浏览 0 评论 0原文

我有一个带有列级级别级别（类似于目录路径）的列的数据框。我只是试图将记录保留在最新一代（层次树的叶子）中。我尝试了使用Transformby和Groupby的几种方法，但无法获得所需的输出

代码

import numpy as np
import pandas as pd

df = pd.DataFrame({'lvl1':['aa','aa','aa','aa','bb','bb','bb','bb','cc','aa'],
                   'lvl2':[np.nan,'xx','xx','xx',np.nan,'yy','yy','zz',np.nan,'sa'],
                   'lvl3':[np.nan,np.nan,'ww','qq',np.nan,np.nan,'rr',np.nan,np.nan,'jj'],
                   'value':[12,4,7,22,76,0,18,47,10,2]})
result = pd.DataFrame({'lvl1':['aa','aa','bb','bb','cc','aa'],
                       'lvl2':['xx','xx','yy','zz',np.nan,'sa'],
                       'lvl3':['ww','qq','rr',np.nan,np.nan,'jj'],
                       'value':[7,22,18,47,10,2]})

figue

感谢您的帮助

原文

I have a dataframe with columns for levels of hierarchy (similar to directory path). I am trying to keep only the records with the latest generation in the levels (leaves of the hierarchy tree). I tried couple ways with transform and groupby but unable to get the desired output

Code

import numpy as np
import pandas as pd

df = pd.DataFrame({'lvl1':['aa','aa','aa','aa','bb','bb','bb','bb','cc','aa'],
                   'lvl2':[np.nan,'xx','xx','xx',np.nan,'yy','yy','zz',np.nan,'sa'],
                   'lvl3':[np.nan,np.nan,'ww','qq',np.nan,np.nan,'rr',np.nan,np.nan,'jj'],
                   'value':[12,4,7,22,76,0,18,47,10,2]})
result = pd.DataFrame({'lvl1':['aa','aa','bb','bb','cc','aa'],
                       'lvl2':['xx','xx','yy','zz',np.nan,'sa'],
                       'lvl3':['ww','qq','rr',np.nan,np.nan,'jj'],
                       'value':[7,22,18,47,10,2]})

Figure

Appreciate your help

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七禾 2025-02-17 00:57:41

如果需要，每行的最大唯一值，对于每个类别而不是最大值，仅限上一行使用：

s = df[["lvl1", "lvl2", "lvl3"]].nunique(axis=1)
#if need test number of non missing values use count
#s = df[["lvl1", "lvl2", "lvl3"]].count(axis=1)

df = df[~s.duplicated(keep='last') | s.eq(s.max())]
print (df)
  lvl1 lvl2 lvl3  value
2   aa   xx   ww      7
3   aa   xx   qq     22
6   bb   yy   rr     18
7   bb   zz  NaN     47
8   cc  NaN  NaN     10
9   aa   sa   jj      2

If need filter rows by maximum unique values per rows and for each category not maximum only last row use:

s = df[["lvl1", "lvl2", "lvl3"]].nunique(axis=1)
#if need test number of non missing values use count
#s = df[["lvl1", "lvl2", "lvl3"]].count(axis=1)

df = df[~s.duplicated(keep='last') | s.eq(s.max())]
print (df)
  lvl1 lvl2 lvl3  value
2   aa   xx   ww      7
3   aa   xx   qq     22
6   bb   yy   rr     18
7   bb   zz  NaN     47
8   cc  NaN  NaN     10
9   aa   sa   jj      2

回复收藏 0 原文

~没有更多了~