直接替换Pyspark列中的值,而无需加入
我有两个数据帧:
data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)
data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
+-----+------+
| name|gender|
+-----+------+
| Andy| male|
|Julie|female|
|Danny| male|
+-----+------+
+------+----+
|gender|enum|
+------+----+
| male| 1|
|female| 2|
+------+----+
我希望用enum
值替换df1
在df2
中的df1
中。我可以通过:
new_df = df1.join(df2, on='gender', how='inner')
然后将列phender
删除,然后在new_df
中重命名列 enum gend> gend> gender
。这很麻烦,并且取决于性别
在df1
和df2
中的同名。
有没有办法在没有这些中间步骤的情况下直接替换值?
I have two data frames:
data1 = [('Andy', 'male'), ('Julie', 'female'), ('Danny', 'male')]
columns1 = ['name', 'gender']
df1 = spark.createDataFrame(data=data1, schema=columns1)
data2 = [('male', 1), ('female', 2)]
columns2 = ['gender', 'enum']
df2 = spark.createDataFrame(data=data2, schema=columns2)
+-----+------+
| name|gender|
+-----+------+
| Andy| male|
|Julie|female|
|Danny| male|
+-----+------+
+------+----+
|gender|enum|
+------+----+
| male| 1|
|female| 2|
+------+----+
I am looking to replace column gender
in df1
with the enum
values from df2
. I could do this by:
new_df = df1.join(df2, on='gender', how='inner')
And then drop column gender
, and rename column enum
in new_df
to gender
. This is cumbersome and depends on column gender
to be of the same name in both df1
and df2
.
Is there a way to directly replace the values without these intermediate steps?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
由于
df2
不包含数千个元素,因此您可以收集所有数据并像这样编写一个UDF:我不确定它更简单,但这是另一种方法。
NB:广泛的演员不是强制性的,但是它将允许单词仅发送给每个执行人一次,而不是每个任务。
Since
df2
does not contain more than a few thousand elements, you can collect all the data and write a udf like this:I am not sure it is simpler, but it is another way at it.
NB: the broad cast is not mandatory, but it will allow the dictionary to be sent to each executor only once instead of every single task.
没有加入,您将需要自己提供所有可能的映射
Without join you would need to provide all the possible mappings yourself