创建一个重复的字段来计算重复行

发布于 2025-01-26 08:38:01 字数 740 浏览 2 评论 0原文

我有以下数据框架：

我想创建一个称为“ fl_dup”的列，该列显示值'0'，以防行是唯一的，或者是第一次发生时。相反，当行重复并第二次发生时，应显示值“ 1”。理想情况下，fl_dup列看起来像这样：

    A  B  FL_DUP      
0   1  1  0
1   1  2  0
2   1  1  1
3   1  1  1
4   2  2  0

我尝试了此代码，但是不幸的是，有时演员表不起作用并返回“ null”值。我也无法获得第一次出现的重复行的“ 0”值。

  df2 = df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
    on=df.columns,
    how="left"
  )

原文

I have the following dataframe:

I would like to create a column called "fl_dup" that shows the value '0' in case the row is unique or when it occurs for the first time. On the contrary, it should show the value '1' when the row is duplicated and occurs the second time onwards. Ideally the fl_dup column would look like this:

    A  B  FL_DUP      
0   1  1  0
1   1  2  0
2   1  1  1
3   1  1  1
4   2  2  0

I tried with this code, but unfortunately sometimes the cast doesn't work and returns 'null' values. I also can't get the '0' value for duplicate rows that appear for the first time.

  df2 = df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
    on=df.columns,
    how="left"
  )

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

念三年u 2025-02-02 08:38:01

以下方式，由于您没有订购行的列），因此该订单可能会丢失：

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, 1),
     (1, 2),
     (1, 1),
     (1, 1), 
     (2, 2)],
    ['A', 'B']
)

w = W.partitionBy('A', 'B').orderBy('A')
df = df.withColumn('fl_dup', F.when(F.row_number().over(w) == 1, 0).otherwise(1))

df.show()
# +---+---+------+
# |  A|  B|fl_dup|
# +---+---+------+
# |  1|  1|     0|
# |  1|  1|     1|
# |  1|  1|     1|
# |  1|  2|     0|
# |  2|  2|     0|
# +---+---+------+

The following way, since you have no column for ordering rows), the order may be lost:

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, 1),
     (1, 2),
     (1, 1),
     (1, 1), 
     (2, 2)],
    ['A', 'B']
)

w = W.partitionBy('A', 'B').orderBy('A')
df = df.withColumn('fl_dup', F.when(F.row_number().over(w) == 1, 0).otherwise(1))

df.show()
# +---+---+------+
# |  A|  B|fl_dup|
# +---+---+------+
# |  1|  1|     0|
# |  1|  1|     1|
# |  1|  1|     1|
# |  1|  2|     0|
# |  2|  2|     0|
# +---+---+------+

回复收藏 0 原文

℡Ms空城旧梦 2025-02-02 08:38:01

这应该执行您的要求：

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1],[1,2],[1,1],[1,1],[2,2]], columns=("A", "B"))
df['FL_DUP'] = np.where(df.duplicated(['A', 'B'], keep='first'), 1, 0)

输出：

   A  B  FL_DUP
0  1  1       0
1  1  2       0
2  1  1       1
3  1  1       1
4  2  2       0

请参阅 pandas。 dataFrame.duplicated
and numpy.where.where 以获取更多信息。

This should do what you are asking for:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1],[1,2],[1,1],[1,1],[2,2]], columns=("A", "B"))
df['FL_DUP'] = np.where(df.duplicated(['A', 'B'], keep='first'), 1, 0)

Outputs:

   A  B  FL_DUP
0  1  1       0
1  1  2       0
2  1  1       1
3  1  1       1
4  2  2       0

See pandas.DataFrame.duplicated
and numpy.where for more info.

回复收藏 0 原文

鲜血染红嫁衣 2025-02-02 08:38:01

创建一个带有所有值零的列，然后将列添加到数据
中

In[0]:
df.insert(2,"fl_dup", list(np.zeros(df.shape[0], dtype = int)), True)
df.loc[df.duplicated(), 'fl_dup'] = '1'
df

框架

    A   B   fl_dup
0   1   1   0
1   1   2   0
2   1   1   1
3   1   1   1
4   2   2   0

Create a column with all the values zero and add the column into the data frame
update the value of the columns with duplicate rows to 1

In[0]:
df.insert(2,"fl_dup", list(np.zeros(df.shape[0], dtype = int)), True)
df.loc[df.duplicated(), 'fl_dup'] = '1'
df

out[1]:

    A   B   fl_dup
0   1   1   0
1   1   2   0
2   1   1   1
3   1   1   1
4   2   2   0

回复收藏 0 原文

~没有更多了~

关于作者

涙—继续流

暂无简介

文章

27 人气

关注发私信

櫻之舞

文章 0 评论 0

关注

弥枳

文章 0 评论 0

关注

m2429

文章 0 评论 0

关注

寻找一个思念的角度

文章 0 评论 0

关注

野却迷人

文章 0 评论 0

关注

我怀念的。

文章 0 评论 0

友情链接

文江博客

创建一个重复的字段来计算重复行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

创建一个重复的字段来计算重复行

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。