创建一个重复的字段来计算重复行
我有以下数据框架:
A B
0 1 1
1 1 2
2 1 1
3 1 1
4 2 2
我想创建一个称为“ fl_dup”的列,该列显示值'0',以防行是唯一的,或者是第一次发生时。相反,当行重复并第二次发生时,应显示值“ 1”。理想情况下,fl_dup列看起来像这样:
A B FL_DUP
0 1 1 0
1 1 2 0
2 1 1 1
3 1 1 1
4 2 2 0
我尝试了此代码,但是不幸的是,有时演员表不起作用并返回“ null”值。我也无法获得第一次出现的重复行的“ 0”值。
df2 = df.join(
df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
on=df.columns,
how="left"
)
I have the following dataframe:
A B
0 1 1
1 1 2
2 1 1
3 1 1
4 2 2
I would like to create a column called "fl_dup" that shows the value '0' in case the row is unique or when it occurs for the first time. On the contrary, it should show the value '1' when the row is duplicated and occurs the second time onwards. Ideally the fl_dup column would look like this:
A B FL_DUP
0 1 1 0
1 1 2 0
2 1 1 1
3 1 1 1
4 2 2 0
I tried with this code, but unfortunately sometimes the cast doesn't work and returns 'null' values. I also can't get the '0' value for duplicate rows that appear for the first time.
df2 = df.join(
df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
on=df.columns,
how="left"
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
以下方式,由于您没有订购行的列),因此该订单可能会丢失:
The following way, since you have no column for ordering rows), the order may be lost:
这应该执行您的要求:
输出:
请参阅 pandas。 dataFrame.duplicated
and numpy.where.where 以获取更多信息。
This should do what you are asking for:
Outputs:
See pandas.DataFrame.duplicated
and numpy.where for more info.
框架
out[1]: