编码pyspark列创建另一列的阶乘值
我有以下pyspark数据框架:
+----------------------+
| Paths |
+----------------------+
|[link1, link2, link3] |
|[link1, link2, link4] |
|[link1, link2, link3] |
|[link1, link2, link4] |
...
..
.
+----------------------+
我想将路径编码为分类变量,然后将此信息添加到数据框架中。结果应该是这样的:
+----------------------+----------------------+
| Paths | encodedPaths |
+----------------------+----------------------+
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
...
..
.
+----------------------+
环顾四周,我找到了这个解决方案:
indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")
它应该起作用,但是在原始数据框架和结果框架中,不同路径的数量并不相同。除此之外,编码列中的某些值显着高于不同路径的数量。这是不可能的,因为单调的increasing函数应线性增强。 您还有其他解决方案吗?
i have the following pyspark dataframe:
+----------------------+
| Paths |
+----------------------+
|[link1, link2, link3] |
|[link1, link2, link4] |
|[link1, link2, link3] |
|[link1, link2, link4] |
...
..
.
+----------------------+
I want to encode the paths into a categorical variable and add this information to the dataframe. The result should be something like this:
+----------------------+----------------------+
| Paths | encodedPaths |
+----------------------+----------------------+
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
|[link1, link2, link3] | 1 |
|[link1, link2, link4] | 2 |
...
..
.
+----------------------+
Looking around i found this solution:
indexer = pathsDF.select("Paths").distinct().withColumn("encodedPaths", F.monotonically_increasing_id())
pathsDF = pathsDF.join(indexer, "Paths")
It should work but the number of distinct paths is not the same among the original and the resulting dataframe. In addition to that some values in the encoded column are significantly higher than the number of distinct paths. This should not be possible since the monotonically_increasing function should increment linearly.
Do you have other solutions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在将数组列施放到字符串之后使用ML -lib的StringIndexer:
You can use StringIndexer from ml - lib after casting the array column to string: