直接替换Pyspark列中的值,而无需加入

发布于 2025-02-13 03:03:12 字数 1101 浏览 1 评论 0原文

我有两个数据帧:

data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)


data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
+-----+------+                                                                  
| name|gender|
+-----+------+
| Andy|  male|
|Julie|female|
|Danny|  male|
+-----+------+

+------+----+
|gender|enum|
+------+----+
|  male|   1|
|female|   2|
+------+----+

我希望用enum值替换df1df2中的df1中。我可以通过:

new_df = df1.join(df2, on='gender', how='inner')

然后将列phender删除,然后在new_df中重命名列 enum gend> gend> gender。这很麻烦,并且取决于性别df1df2中的同名。

有没有办法在没有这些中间步骤的情况下直接替换值?

I have two data frames:

data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)


data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
+-----+------+                                                                  
| name|gender|
+-----+------+
| Andy|  male|
|Julie|female|
|Danny|  male|
+-----+------+

+------+----+
|gender|enum|
+------+----+
|  male|   1|
|female|   2|
+------+----+

I am looking to replace column gender in df1 with the enum values from df2. I could do this by:

new_df = df1.join(df2, on='gender', how='inner')

And then drop column gender, and rename column enum in new_df to gender. This is cumbersome and depends on column gender to be of the same name in both df1 and df2.

Is there a way to directly replace the values without these intermediate steps?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

迷爱 2025-02-20 03:03:12

由于df2不包含数千个元素,因此您可以收集所有数据并像这样编写一个UDF:

df2_list = df2.collect()
d = sc.broadcast(dict([(c[0], c[1]) for c in df2_list]))
replace = f.udf(lambda x: d.value[x])

# then you can use replace on any dataframe like this;
df1.withColumn("gender", replace("gender")).show()
+-----+------+
| name|gender|
+-----+------+
| Andy|     1|
|Julie|     2|
|Danny|     1|
+-----+------+

我不确定它更简单,但这是另一种方法。

NB:广泛的演员不是强制性的,但是它将允许单词仅发送给每个执行人一次,而不是每个任务。

Since df2 does not contain more than a few thousand elements, you can collect all the data and write a udf like this:

df2_list = df2.collect()
d = sc.broadcast(dict([(c[0], c[1]) for c in df2_list]))
replace = f.udf(lambda x: d.value[x])

# then you can use replace on any dataframe like this;
df1.withColumn("gender", replace("gender")).show()
+-----+------+
| name|gender|
+-----+------+
| Andy|     1|
|Julie|     2|
|Danny|     1|
+-----+------+

I am not sure it is simpler, but it is another way at it.

NB: the broad cast is not mandatory, but it will allow the dictionary to be sent to each executor only once instead of every single task.

变身佩奇 2025-02-20 03:03:12

没有加入,您将需要自己提供所有可能的映射

df = df.replace({'male':'1', 'female':'2'}, subset='gender')

Without join you would need to provide all the possible mappings yourself

df = df.replace({'male':'1', 'female':'2'}, subset='gender')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文