如何使用修改后的 bfill pandas 将附近的重复项折叠成一行
我有一个如下所示的数据框,
ID,F1,F2,F3,F4,F5,F6,L1,L2,L3,L4,L5,L6
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
2,X,,,X,,X,A,B,C,D,E
3,X,X,X,,,X,A
3,X,X,X,,,X,,B,,C
3,X,X,X,,,X,,D,C
4,X,X,,,,,A,B
4,,,,X,,X,G,H,I
4,,,X,,,,T
df = pd.read_clipboard(sep=',')
我想执行以下操作:
a)删除完整的重复项(其中每列的所有值都匹配)。 例如:ID=1 (keep=first)
b) 将附近的重复项折叠成一行。 例如:ID= 3 和 4。接近重复是仅 ID 匹配但其余 F 编号和 L 编号列不同
的行
我尝试执行以下操作,但会导致不正确的输出
下面的代码未复制其他 L 编号值之前没有 NA
df = df.drop_duplicates(keep='first') # this drops full duplicates ex:ID = 1
df.groupby(['ID'])['ID','F1','F2','F3','F4','F5','F6','L1','L2','L3','L4','L5','L6'].bfill().drop_duplicates(subset=['ID'],keep='first')
在实际数据中,有 50 个 F 列和 50 个 L 列。对于F列
,X的位置很重要并且必须正确,而对于L列,它可以在任何地方,只要它被捕获就可以了。
我希望我的输出如下所示
I have a dataframe like as shown below
ID,F1,F2,F3,F4,F5,F6,L1,L2,L3,L4,L5,L6
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
1,X,,X,,,X,A,B,C
2,X,,,X,,X,A,B,C,D,E
3,X,X,X,,,X,A
3,X,X,X,,,X,,B,,C
3,X,X,X,,,X,,D,C
4,X,X,,,,,A,B
4,,,,X,,X,G,H,I
4,,,X,,,,T
df = pd.read_clipboard(sep=',')
I would like to do the below
a) Remove full duplicates (where all values of each column match). ex: ID=1 (keep=first)
b) Collapse near duplicates into one row. ex: ID= 3 and 4. Near duplicates are rows where only ID match but rest of the F numbered and L number columns differ
I was trying the below but it results in incorrect output
The below code misses to copy other L numbered values which doesn't have NA before
df = df.drop_duplicates(keep='first') # this drops full duplicates ex:ID = 1
df.groupby(['ID'])['ID','F1','F2','F3','F4','F5','F6','L1','L2','L3','L4','L5','L6'].bfill().drop_duplicates(subset=['ID'],keep='first')
In real data, there are 50 F columns and 50 L columns. For F columns
the position of X is important and has to be correct whereas for L columns, it can be anywhere as long as it is captured, it is fine.
I expect my output to be like as shown below
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用:
Use: