如何为pyspark.sql.dataframe.dataframe编写此PANDAS逻辑，而不会在Spark API上使用Pandas？

发布于 2025-02-13 07:09:49 字数 444 浏览 2 评论 0原文

我是Pyspark的新手，因为Pyspark没有LOC功能，我们如何编写此逻辑。我尝试了指定条件，但无法获得理想的结果，任何帮助将不胜感激！

df['Total'] = (df['level1']+df['level2']+df['level3']+df['level4'])/df['Number']
df.loc[df['level4'] > 0, 'Total'] += 4
df.loc[((df['level3'] > 0) & (df['Total'] < 1)), 'Total'] += 3
df.loc[((df['level2'] > 0) & (df['Total'] < 1)), 'Total'] += 2
df.loc[((df['level1'] > 0) & (df['Total'] < 1)), 'Total'] += 1

原文

I'm totally new to Pyspark, as Pyspark doesn't have loc feature how can we write this logic. I tried by specifying conditions but couldn't get the desirable result, any help would be greatly appreciated!

df['Total'] = (df['level1']+df['level2']+df['level3']+df['level4'])/df['Number']
df.loc[df['level4'] > 0, 'Total'] += 4
df.loc[((df['level3'] > 0) & (df['Total'] < 1)), 'Total'] += 3
df.loc[((df['level2'] > 0) & (df['Total'] < 1)), 'Total'] += 2
df.loc[((df['level1'] > 0) & (df['Total'] < 1)), 'Total'] += 1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给妤﹃绝世温柔 2025-02-20 07:09:51

对于以下数据，

data_ls = [
    (1, 1, 1, 1, 10),
    (5, 5, 5, 5, 10)
]

data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['level1', 'level2', 'level3', 'level4', 'number'])

# +------+------+------+------+------+
# |level1|level2|level3|level4|number|
# +------+------+------+------+------+
# |     1|     1|     1|     1|    10|
# |     5|     5|     5|     5|    10|
# +------+------+------+------+------+

您实际上是在每个语句中实际上更新total列，而不是以if-then-else方式更新。可以使用多个withColumn（）在pyspark中复制您的代码（如），例如（）如下。

data_sdf. \
    withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). \
    withColumn('total', func.when(func.col('level4') > 0, func.col('total') + 4).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).otherwise(func.col('total'))). \
    show()

# +------+------+------+------+------+-----+
# |level1|level2|level3|level4|number|total|
# +------+------+------+------+------+-----+
# |     1|     1|     1|     1|    10|  4.4|
# |     5|     5|     5|     5|    10|  6.0|
# +------+------+------+------+------+-----+

我们可以将所有与column（）与合并为（） >语句。

data_sdf. \
withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). \
withColumn('total', 
           func.when(func.col('level4') > 0, func.col('total') + 4).
           when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).
           when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).
           when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).
           otherwise(func.col('total'))
           ). \
show()

# +------+------+------+------+------+-----+
# |level1|level2|level3|level4|number|total|
# +------+------+------+------+------+-----+
# |     1|     1|     1|     1|    10|  4.4|
# |     5|     5|     5|     5|    10|  6.0|
# +------+------+------+------+------+-----+

就像和 sql的案例语句。

For a data like the following

data_ls = [
    (1, 1, 1, 1, 10),
    (5, 5, 5, 5, 10)
]

data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['level1', 'level2', 'level3', 'level4', 'number'])

# +------+------+------+------+------+
# |level1|level2|level3|level4|number|
# +------+------+------+------+------+
# |     1|     1|     1|     1|    10|
# |     5|     5|     5|     5|    10|
# +------+------+------+------+------+

You're actually updating total column in each statement, not in an if-then-else way. Your code can be replicated (as is) in pyspark using multiple withColumn() with when() like the following.

data_sdf. \
    withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). \
    withColumn('total', func.when(func.col('level4') > 0, func.col('total') + 4).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).otherwise(func.col('total'))). \
    withColumn('total', func.when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).otherwise(func.col('total'))). \
    show()

# +------+------+------+------+------+-----+
# |level1|level2|level3|level4|number|total|
# +------+------+------+------+------+-----+
# |     1|     1|     1|     1|    10|  4.4|
# |     5|     5|     5|     5|    10|  6.0|
# +------+------+------+------+------+-----+

We can merge all the withColumn() with when() into a single withColumn() with multiple when() statements.

data_sdf. \
withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). \
withColumn('total', 
           func.when(func.col('level4') > 0, func.col('total') + 4).
           when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).
           when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).
           when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).
           otherwise(func.col('total'))
           ). \
show()

# +------+------+------+------+------+-----+
# |level1|level2|level3|level4|number|total|
# +------+------+------+------+------+-----+
# |     1|     1|     1|     1|    10|  4.4|
# |     5|     5|     5|     5|    10|  6.0|
# +------+------+------+------+------+-----+

It's like numpy.where and SQL's case statements.

回复收藏 0 原文

~没有更多了~