如果组中存在非障碍,如何删除重复和零?

发布于 2025-02-12 09:54:16 字数 1128 浏览 1 评论 0原文

应根据 flag 列过滤后面的数据框。如果基于列的组 id cod 没有任何 none 的价值的行,则有必要仅维护独特的行,否则,有必要在 flag 中删除值的行。

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number,max

spark = SparkSession.builder.appName('Vazio').getOrCreate()

data = [('1', 10, 'A'),
('1', 10, 'A'),
('1', 10, None),
('1', 15, 'A'),
('1', 15, None),
('2', 11, 'A'),
('2', 11, 'C'),
('2', 12, 'B'),
('2', 12, 'B'),
('2', 12, 'C'),
('2', 12, 'C'),
('2', 13, None),
('3', 14, None),
('3', 14, None),
('3', 15, None),
('4', 21, 'A'),
('4', 21, 'B'),
('4', 21, 'C'),
('4', 21, 'C')]

df = spark.createDataFrame(data=data, schema = ['id', 'cod','flag'])
df.show()

如何使用Pyspark根据上一个数据框架获取下一个数据帧?

+---+---+----+
| id|cod|flag|
+---+---+----+
|  1| 10|   A|
|  1| 15|   A|
|  2| 11|   A|
|  2| 11|   C|
|  2| 12|   B|
|  2| 12|   C|
|  2| 13|null|
|  3| 14|null|
|  3| 15|null|
|  4| 21|   A|
|  4| 21|   C|
+---+---+----+

The follow DataFrame should be filtered based on the flag column. If the group based on columns id and cod doesn't have any row with value different of None, it's necessary to maintain just a unique row, otherwise, it's necessary to remove the row with None value in column flag.

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number,max

spark = SparkSession.builder.appName('Vazio').getOrCreate()

data = [('1', 10, 'A'),
('1', 10, 'A'),
('1', 10, None),
('1', 15, 'A'),
('1', 15, None),
('2', 11, 'A'),
('2', 11, 'C'),
('2', 12, 'B'),
('2', 12, 'B'),
('2', 12, 'C'),
('2', 12, 'C'),
('2', 13, None),
('3', 14, None),
('3', 14, None),
('3', 15, None),
('4', 21, 'A'),
('4', 21, 'B'),
('4', 21, 'C'),
('4', 21, 'C')]

df = spark.createDataFrame(data=data, schema = ['id', 'cod','flag'])
df.show()

How could I obtain the next DataFrame based on last one using PySpark?

+---+---+----+
| id|cod|flag|
+---+---+----+
|  1| 10|   A|
|  1| 15|   A|
|  2| 11|   A|
|  2| 11|   C|
|  2| 12|   B|
|  2| 12|   C|
|  2| 13|null|
|  3| 14|null|
|  3| 15|null|
|  4| 21|   A|
|  4| 21|   C|
+---+---+----+

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

流心雨 2025-02-19 09:54:16

这可能是一个解决方案,您可以添加

new_df = pd.DataFrame(columns=df.columns)

for index, row in df.iterrows():
    if row.values.tolist() not in new_df.values.tolist():

        if row["flag"] is None and row.values.tolist()[:2] in new_df[list(new_df.columns)[:2]].values.tolist():
            continue

        new_df.loc[-1] = row.values.tolist()
        new_df.index += 1

最后,

df = new_df.copy(deep=True)
del new_df

this could be a solution

new_df = pd.DataFrame(columns=df.columns)

for index, row in df.iterrows():
    if row.values.tolist() not in new_df.values.tolist():

        if row["flag"] is None and row.values.tolist()[:2] in new_df[list(new_df.columns)[:2]].values.tolist():
            continue

        new_df.loc[-1] = row.values.tolist()
        new_df.index += 1

at the end you could just add

df = new_df.copy(deep=True)
del new_df
暮年慕年 2025-02-19 09:54:16

使用pyspark。改编自这个答案(spark)

window = Window.partitionBy(['id', 'cod']).orderBy(col('flag').desc())
out = (df.withColumn('row',  row_number().over(window))
         .filter(col('row') == 1).drop('row'))
out.show()

# Output
+---+---+----+
| id|cod|flag|
+---+---+----+
|  2| 11|   A|
|  2| 12|   B|
|  1| 15|   A|
|  3| 14|null|
|  2| 13|null|
|  3| 15|null|
|  1| 10|   A|
+---+---+----+

设置

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

data = [['1', 10, 'A'],
        ['1', 10, 'A'],
        ['1', 10, None],
        ['1', 15, 'A'],
        ['1', 15, None],
        ['2', 11, 'A'],
        ['2', 12, 'B'],
        ['2', 12, 'B'],
        ['2', 13, None],
        ['3', 14, None],
        ['3', 14, None],
        ['3', 15, None]]

columns = ['id', 'cod', 'flag']
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(data = data, schema = columns)
df.show()

# Output
+---+---+----+
| id|cod|flag|
+---+---+----+
|  1| 10|   A|
|  1| 10|   A|
|  1| 10|null|
|  1| 15|   A|
|  1| 15|null|
|  2| 11|   A|
|  2| 12|   B|
|  2| 12|   B|
|  2| 13|null|
|  3| 14|null|
|  3| 14|null|
|  3| 15|null|
+---+---+----+

With PySpark. Adapted from this answer (Spark)

window = Window.partitionBy(['id', 'cod']).orderBy(col('flag').desc())
out = (df.withColumn('row',  row_number().over(window))
         .filter(col('row') == 1).drop('row'))
out.show()

# Output
+---+---+----+
| id|cod|flag|
+---+---+----+
|  2| 11|   A|
|  2| 12|   B|
|  1| 15|   A|
|  3| 14|null|
|  2| 13|null|
|  3| 15|null|
|  1| 10|   A|
+---+---+----+

Setup

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number

data = [['1', 10, 'A'],
        ['1', 10, 'A'],
        ['1', 10, None],
        ['1', 15, 'A'],
        ['1', 15, None],
        ['2', 11, 'A'],
        ['2', 12, 'B'],
        ['2', 12, 'B'],
        ['2', 13, None],
        ['3', 14, None],
        ['3', 14, None],
        ['3', 15, None]]

columns = ['id', 'cod', 'flag']
spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame(data = data, schema = columns)
df.show()

# Output
+---+---+----+
| id|cod|flag|
+---+---+----+
|  1| 10|   A|
|  1| 10|   A|
|  1| 10|null|
|  1| 15|   A|
|  1| 15|null|
|  2| 11|   A|
|  2| 12|   B|
|  2| 12|   B|
|  2| 13|null|
|  3| 14|null|
|  3| 14|null|
|  3| 15|null|
+---+---+----+
说好的呢 2025-02-19 09:54:16

回答已编辑的问题:

df = df.groupBy('id', 'cod').agg(F.collect_set('flag').alias('flag'))
df = df.withColumn(
    'flag',
    F.explode(F.when(F.size('flag') != 0, F.col('flag')).otherwise(F.array(F.lit(None))))
)
df.show()
# +---+---+----+
# | id|cod|flag|
# +---+---+----+
# |  2| 11|   C|
# |  2| 11|   A|
# |  2| 12|   C|
# |  2| 12|   B|
# |  1| 15|   A|
# |  1| 10|   A|
# |  3| 14|null|
# |  4| 21|   C|
# |  4| 21|   B|
# |  4| 21|   A|
# |  2| 13|null|
# |  3| 15|null|
# +---+---+----+

请注意,组(4,21)包含所有3个标志A,B,C。在问题的描述中,您尚未提及仅留下2个值,但是在示例中以某种方式删除了B,因此我将其假定为错误。


回答原始问题:

df = df.groupBy('id', 'cod').agg(F.max('flag').alias('flag'))

您不能仅根据列ID和COD删除重复项,因为无法保证您始终会从中获得值列标志不是 无效的。

Answer to the edited question:

df = df.groupBy('id', 'cod').agg(F.collect_set('flag').alias('flag'))
df = df.withColumn(
    'flag',
    F.explode(F.when(F.size('flag') != 0, F.col('flag')).otherwise(F.array(F.lit(None))))
)
df.show()
# +---+---+----+
# | id|cod|flag|
# +---+---+----+
# |  2| 11|   C|
# |  2| 11|   A|
# |  2| 12|   C|
# |  2| 12|   B|
# |  1| 15|   A|
# |  1| 10|   A|
# |  3| 14|null|
# |  4| 21|   C|
# |  4| 21|   B|
# |  4| 21|   A|
# |  2| 13|null|
# |  3| 15|null|
# +---+---+----+

Note that the group (4, 21) contains all the 3 flags A, B, C. In the description of the question, you haven't mentioned leaving just 2 values, but in the example somehow the B was deleted, so I assumed it as a mistake.


Answer to the original question:

df = df.groupBy('id', 'cod').agg(F.max('flag').alias('flag'))

You couldn't just remove duplicates based on columns id and cod, as there's no guarantee that you will always get a value from column flag which is not null.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文