Pyspark中这两种不同形式的排序（Orderby）之间是否存在差异？

发布于 2025-02-12 14:35:57 字数 1589 浏览 0 评论 0原文

假设我们有以下数据框架（从示例'pyspark'网站上借来）：

simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Raman","Finance","CA",99000,40,24000), \
    ("Scott","Finance","NY",83000,36,19000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000) \
  ]

然后，如果我们运行以下两个排序（orderby）命令：

df.sort("department","state").show(truncate=False)

或者

df.sort(col("department"),col("state")).show(truncate=False)

我们得到相同的结果：

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

我知道第一个命令以DataFrame列名为单位作为字符串和下一个列以列类型的列。但是，在处理或将来用途等任务的情况下，这两个之间有区别吗？其中之一比另一种或Pyspark标准形式更好吗？还是它们只是别名？

PS：除上述内容外，我问这个问题的原因之一是有人告诉我使用Spark的“标准”业务形式。例如，“别名”在业务中更受欢迎。当然，这对我来说并不正确。

原文

Say we have the following dataframe (which is borrowed from 'PySpark by Examples' website):

simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Raman","Finance","CA",99000,40,24000), \
    ("Scott","Finance","NY",83000,36,19000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000) \
  ]

Then, if we run the two following sort (orderBy) commands:

df.sort("department","state").show(truncate=False)

df.sort(col("department"),col("state")).show(truncate=False)

We get the same result:

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

I know the first one takes the DataFrame column name as a string and the next one takes columns in Column type. But is there a difference between these two in case of tasks such as processing or future uses? Is one of them better than the other or pySpark standard form? Or are they just aliases?

PS: In addition to the above, one of the reasons I'm asking this question is that someone told me there is a 'standard' business form for using Spark. For example, 'alias' is more popular than 'withColumnRenamed' in the business. Of course, this doesn't sound right to me.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

要走就滚别墨迹 2025-02-19 14:35:57

如果您查看解释计划，您会发现两个查询都会生成相同的物理计划，因此在处理方面，它们相同。

df_sort1 = df.sort("department", "state")
df_sort2 = df.sort(col("department"), col("state"))

df_sort1.explain()
df_sort2.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#8]
      +- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#18]
      +- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]

企业可能会有编码指南，其中指定使用什么。如果它们存在，请跟随它们。如果没有，并且您正在处理现有代码，那么通常最好遵循已经存在的代码。否则，它主要是偏爱，我不知道Pyspark的“标准业务形式”。

如果您是与columnrenrensumens的别名与columnrenrensument，则有一个有利于别名的论点，如果您重命名了多个列。选择别名将在解析的逻辑计划中生成一个单个投影，其中使用多个使用columnrensions会产生多个投影。

If you look at the explain plan you'll see that both queries generate the same physical plan, so processing wise they are identical.

df_sort1 = df.sort("department", "state")
df_sort2 = df.sort(col("department"), col("state"))

df_sort1.explain()
df_sort2.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#8]
      +- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#18]
      +- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]

Businesses might have coding guidelines in which they specify what to use. If they exist then follow them. If not and you're working on existing code then usually best to follow what is there already. Otherwise its mainly preference, I'm not aware of a 'standard business form' of pyspark.

In case of alias vs withColumnRenamed there is an argument to be made in favor of alias if you're renaming multiple columns. Selecting with alias will generate a single projection in the parsed logical plan where using multiple withColumnRenamed will generate multiple projections.

回复收藏 0 原文

夏雨凉 2025-02-19 14:35:57

可以肯定的是，这两个版本也可以做同一件事，我们可以查看 dataframe.py 。这是sort方法的签名：

def sort(
    self, *cols: Union[str, Column, List[Union[str, Column]]], **kwargs: Any
) -> "DataFrame":

当您遵循各种方法调用时，您最终进入此行：

jcols = [_to_java_column(cast("ColumnOrName", c)) for c in cols]

，将所有列对象转换为字符串或列（CF方法签名）到Java列。然后，只有Java列被使用，无论它们是如何传递给方法的，因此Sort方法的两个版本使用完全相同的代码进行完全相同的操作。

To be certain that the two versions do the same thing, we can have a look at the source code of dataframe.py. Here is the signature of the sort method:

def sort(
    self, *cols: Union[str, Column, List[Union[str, Column]]], **kwargs: Any
) -> "DataFrame":

When you follow the various method calls, you end up on this line:

jcols = [_to_java_column(cast("ColumnOrName", c)) for c in cols]

, that converts all column objects, whether they are strings or columns (cf method signature) to java columns. Then only the java columns are used regardless of how they were passed to the method so the two versions of the sort method do the exact same thing with the exact same code.

回复收藏 0 原文

~没有更多了~