Pyspark：如何对倾斜聚合使用盐化技术

发布于 2025-01-09 12:41:27 字数 958 浏览 6 评论 0原文

如何在 Pyspark 中使用盐化技术进行倾斜聚合。

假设我们有倾斜的数据，如下所示，如何创建盐列并在聚合中使用它。

城邦	3,00,000	数量
Lachung	锡金	3,000
Rangpo	锡金	50,000
甘托克	锡金	班加罗尔
卡纳塔克邦	2,50,00,000	孟买
马哈拉施特	拉邦	2,90,00,000

原文

How to use salting technique for Skewed Aggregation in Pyspark.

Say we have Skewed data like below how to create salting column and use it in aggregation.

city	state	count
Lachung	Sikkim	3,000
Rangpo	Sikkim	50,000
Gangtok	Sikkim	3,00,000
Bangalore	Karnataka	2,50,00,000
Mumbai	Maharashtra	2,90,00,000

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

随风而去 2025-01-16 12:41:27

要对倾斜数据使用加盐技术，我们需要创建一个名为“盐”的列。生成一个范围从 0 到 (spark.sql.shuffle.partitions - 1) 的随机编号。

表应如下所示，其中“salt”列的值从 0 到 199（在本例中分区大小为 200）。现在您可以对“城市”、“州”、“盐”使用 groupBy。

城邦	锡金	盐拉
冲	锡金	151
拉冲	锡金	102
拉	锡金	16
冲	朗波锡金	5
朗波	锡金	19
托克	朗波锡金	16
托克	锡金	119
甘	锡金	55
甘	16	102
甘托克	朗波锡金	甘
托克	锡金	10
班加罗尔	卡纳塔克邦	19
Mumbai	Maharashtra	0
Bangalore	Karnataka	199
Mumbai	Maharashtra	190

代码：

from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import (
    StructType, StructField, IntegerType
)

salval = f.round(f.rand() * int(spark.conf.get("spark.sql.shuffle.partitions")) -1 )

record_df.withColumn("salt", f.lit(salval).cast(IntegerType()))\
    .groupBy("city", "state", "salt")\
    .agg(
      f.count("city")
    )\
    .drop("salt")

输出

城邦	计数	Mumbai
Lachung	Sikkim	3,000
Rangpo	Sikkim	50,000
Gangtok	Sikkim	3,00,000
Bangalore	Karnataka	2,50,00,000
：	Maharashtra	2,90,00,000

To use the salting technique on skewed data, we need to create a column say "salt". Generate a random no with a range from 0 to (spark.sql.shuffle.partitions - 1).

Table should look like below, where "salt" column will have value from 0 to 199 (as in this case partitions size is 200). Now you can use groupBy on "city", "state", "salt".

city	state	salt
Lachung	Sikkim	151
Lachung	Sikkim	102
Lachung	Sikkim	16
Rangpo	Sikkim	5
Rangpo	Sikkim	19
Rangpo	Sikkim	16
Rangpo	Sikkim	102
Gangtok	Sikkim	55
Gangtok	Sikkim	119
Gangtok	Sikkim	16
Gangtok	Sikkim	10
Bangalore	Karnataka	19
Mumbai	Maharashtra	0
Bangalore	Karnataka	199
Mumbai	Maharashtra	190

code:

from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import (
    StructType, StructField, IntegerType
)

salval = f.round(f.rand() * int(spark.conf.get("spark.sql.shuffle.partitions")) -1 )

record_df.withColumn("salt", f.lit(salval).cast(IntegerType()))\
    .groupBy("city", "state", "salt")\
    .agg(
      f.count("city")
    )\
    .drop("salt")

output: