创建一个重复的字段来计算重复行

发布于 2025-01-26 08:38:01 字数 740 浏览 2 评论 0原文

我有以下数据框架:

    A  B        
0   1  1
1   1  2
2   1  1
3   1  1 
4   2  2

我想创建一个称为“ fl_dup”的列,该列显示值'0',以防行是唯一的,或者是第一次发生时。相反,当行重复并第二次发生时,应显示值“ 1”。理想情况下,fl_dup列看起来像这样:

    A  B  FL_DUP      
0   1  1  0
1   1  2  0
2   1  1  1
3   1  1  1
4   2  2  0

我尝试了此代码,但是不幸的是,有时演员表不起作用并返回“ null”值。我也无法获得第一次出现的重复行的“ 0”值。

  df2 = df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
    on=df.columns,
    how="left"
  )

I have the following dataframe:

    A  B        
0   1  1
1   1  2
2   1  1
3   1  1 
4   2  2

I would like to create a column called "fl_dup" that shows the value '0' in case the row is unique or when it occurs for the first time. On the contrary, it should show the value '1' when the row is duplicated and occurs the second time onwards. Ideally the fl_dup column would look like this:

    A  B  FL_DUP      
0   1  1  0
1   1  2  0
2   1  1  1
3   1  1  1
4   2  2  0

I tried with this code, but unfortunately sometimes the cast doesn't work and returns 'null' values. I also can't get the '0' value for duplicate rows that appear for the first time.

  df2 = df.join(
    df.groupBy(df.columns).agg((f.count("*")>1).cast("int").alias("FL_DUP")),
    on=df.columns,
    how="left"
  )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

念三年u 2025-02-02 08:38:01

以下方式,由于您没有订购行的列),因此该订单可能会丢失:

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, 1),
     (1, 2),
     (1, 1),
     (1, 1), 
     (2, 2)],
    ['A', 'B']
)

w = W.partitionBy('A', 'B').orderBy('A')
df = df.withColumn('fl_dup', F.when(F.row_number().over(w) == 1, 0).otherwise(1))

df.show()
# +---+---+------+
# |  A|  B|fl_dup|
# +---+---+------+
# |  1|  1|     0|
# |  1|  1|     1|
# |  1|  1|     1|
# |  1|  2|     0|
# |  2|  2|     0|
# +---+---+------+

The following way, since you have no column for ordering rows), the order may be lost:

from pyspark.sql import functions as F, Window as W

df = spark.createDataFrame(
    [(1, 1),
     (1, 2),
     (1, 1),
     (1, 1), 
     (2, 2)],
    ['A', 'B']
)

w = W.partitionBy('A', 'B').orderBy('A')
df = df.withColumn('fl_dup', F.when(F.row_number().over(w) == 1, 0).otherwise(1))

df.show()
# +---+---+------+
# |  A|  B|fl_dup|
# +---+---+------+
# |  1|  1|     0|
# |  1|  1|     1|
# |  1|  1|     1|
# |  1|  2|     0|
# |  2|  2|     0|
# +---+---+------+
℡Ms空城旧梦 2025-02-02 08:38:01

这应该执行您的要求:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1],[1,2],[1,1],[1,1],[2,2]], columns=("A", "B"))
df['FL_DUP'] = np.where(df.duplicated(['A', 'B'], keep='first'), 1, 0) 

输出:

   A  B  FL_DUP
0  1  1       0
1  1  2       0
2  1  1       1
3  1  1       1
4  2  2       0

请参阅 pandas。 dataFrame.duplicated
and numpy.where.where 以获取更多信息。

This should do what you are asking for:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,1],[1,2],[1,1],[1,1],[2,2]], columns=("A", "B"))
df['FL_DUP'] = np.where(df.duplicated(['A', 'B'], keep='first'), 1, 0) 

Outputs:

   A  B  FL_DUP
0  1  1       0
1  1  2       0
2  1  1       1
3  1  1       1
4  2  2       0

See pandas.DataFrame.duplicated
and numpy.where for more info.

鲜血染红嫁衣 2025-02-02 08:38:01
  1. 创建一个带有所有值零的列,然后将列添加到数据
In[0]:
df.insert(2,"fl_dup", list(np.zeros(df.shape[0], dtype = int)), True)
df.loc[df.duplicated(), 'fl_dup'] = '1'
df

框架

    A   B   fl_dup
0   1   1   0
1   1   2   0
2   1   1   1
3   1   1   1
4   2   2   0
  1. Create a column with all the values zero and add the column into the data frame
  2. update the value of the columns with duplicate rows to 1
In[0]:
df.insert(2,"fl_dup", list(np.zeros(df.shape[0], dtype = int)), True)
df.loc[df.duplicated(), 'fl_dup'] = '1'
df

out[1]:

    A   B   fl_dup
0   1   1   0
1   1   2   0
2   1   1   1
3   1   1   1
4   2   2   0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文