编码pyspark列创建另一列的阶乘值

发布于 2025-01-17 09:19:06 字数 1048 浏览 0 评论 0原文

我有以下pyspark数据框架：

+----------------------+
|        Paths         |
+----------------------+
|[link1, link2, link3] |               
|[link1, link2, link4] |          
|[link1, link2, link3] |              
|[link1, link2, link4] |
...
..
. 
+----------------------+

我想将路径编码为分类变量，然后将此信息添加到数据框架中。结果应该是这样的：

+----------------------+----------------------+
|        Paths         |      encodedPaths    |
+----------------------+----------------------+
|[link1, link2, link3] |          1           |     
|[link1, link2, link4] |          2           |
|[link1, link2, link3] |          1           |
|[link1, link2, link4] |          2           |
...
..
. 
+----------------------+

环顾四周，我找到了这个解决方案：

indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")

它应该起作用，但是在原始数据框架和结果框架中，不同路径的数量并不相同。除此之外，编码列中的某些值显着高于不同路径的数量。这是不可能的，因为单调的increasing函数应线性增强。您还有其他解决方案吗？

原文

i have the following pyspark dataframe:

+----------------------+
|        Paths         |
+----------------------+
|[link1, link2, link3] |               
|[link1, link2, link4] |          
|[link1, link2, link3] |              
|[link1, link2, link4] |
...
..
. 
+----------------------+

I want to encode the paths into a categorical variable and add this information to the dataframe. The result should be something like this:

+----------------------+----------------------+
|        Paths         |      encodedPaths    |
+----------------------+----------------------+
|[link1, link2, link3] |          1           |     
|[link1, link2, link4] |          2           |
|[link1, link2, link3] |          1           |
|[link1, link2, link4] |          2           |
...
..
. 
+----------------------+

Looking around i found this solution:

indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")

It should work but the number of distinct paths is not the same among the original and the resulting dataframe. In addition to that some values in the encoded column are significantly higher than the number of distinct paths. This should not be possible since the monotonically_increasing function should increment linearly.
Do you have other solutions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一个人练习一个人 2025-01-24 09:19:06

您可以在将数组列施放到字符串之后使用ML -lib的StringIndexer：

from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")

df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))

out = stringIndexer.fit(df2).transform(df2)\
     .withColumn("encodedPaths",F.col("encodedPaths")+1)\
      .select(*df.columns,"encodedPaths")

out.show(truncate=False)
+---------------------+------------+
|Paths                |encodedPaths|
+---------------------+------------+
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
+---------------------+------------+

You can use StringIndexer from ml - lib after casting the array column to string:

from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="PathsStr", outputCol="encodedPaths")

df2 = df.withColumn("PathsStr",F.col("Paths").cast("string"))
#or df2 = df.withColumn("PathsStr",F.concat_ws(",","Paths"))

out = stringIndexer.fit(df2).transform(df2)\
     .withColumn("encodedPaths",F.col("encodedPaths")+1)\
      .select(*df.columns,"encodedPaths")

out.show(truncate=False)
+---------------------+------------+
|Paths                |encodedPaths|
+---------------------+------------+
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
|[link1, link2, link3]|1.0         |
|[link1, link2, link4]|2.0         |
+---------------------+------------+

回复收藏 0 原文

~没有更多了~