pyspark加入乘以普通列
美好的一天,
我正在加入Pyspark上的数据。来自SQL,我喜欢通过这样的通用密钥定义加入。
data_want = data_1.join(data_2, data_1.common_key == data_2.common_key , 'left' )
data_want.columns
[< normal_columns>,common_key,common_key]
我得到 common_key -column的双重条目。非常奇怪。使用较短的语法执行此操作时:
data_want = data_1.join(data_2, 'common_key' , 'left' )
data_want.columns
[< normal_columns>,common_key]
似乎都可以。
谁能解释这里发生了什么?此外,如何编写更长的版本,我发现这更熟悉。我似乎无法指向具有相同名称的第二列。
在带有Spark 3.2.1和Scala 2.12的Databrick上运行
Good day,
I'm joining data on pySpark. Coming from SQL I like to define join by common key like this.
data_want = data_1.join(data_2, data_1.common_key == data_2.common_key , 'left' )
data_want.columns
[<normal_columns>, common_key, common_key]
I get doubly entries of common_key-column. Very odd. When doing this with shorter syntax:
data_want = data_1.join(data_2, 'common_key' , 'left' )
data_want.columns
[<normal_columns>, common_key]
All seems to be ok.
Can anyone explain what's going on here? Moreover, how would one go about writing the longer version, which I find more familiar. I can't seem to point to 2nd column with same name.
Running on DataBricks with Spark 3.2.1 and Scala 2.12
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在Spark中,每列都有Catalyst/SQL引擎使用的唯一ID。 ID是内部的,更多的是元数据,很难参考。加入表达式不会消除common_keys。因为他们有通用名称,所以他们加倍了。您不能通过称呼他们的名字来放下他们。 ID也被隐藏了。因此,一种常见的方法是介绍DF以创建一个唯一的名称。使用名称删除列。下面的代码。
另一个选择是重命名常见列,然后加入
In spark, each column has a unique id used by the catalyst/sql engine. The id is internal and more of a meta data and very hard to refer to. Join expressions do not eliminate the common_keys. Because they have common names, they double up. You cant drop them by calling their names. The id is hidden too. A common approach is therefore to lias the df to create a unique name. use the name to drop the column. Code below.
Another option is to rename the common columns and then join