欧几里得的距离或与矢量的圆柱之间的相似性
我有以下形式的Spark数据框:
> df1
+---------------+----------------+
| vector1| vector2|
+---------------+----------------+
|[[0.9,0.5,0.2]]| [[0.1,0.3,0.2]]|
|[[0.8,0.7,0.1]]| [[0.8,0.4,0.2]]|
|[[0.9,0.2,0.8]]| [[0.3,0.1,0.8]]|
+---------------+----------------+
> df1.printSchema()
root
|-- vector1: array (nullable = true)
| |-- element: vector (containsNull = true)
|-- vector2: array (nullable = true)
| |-- element: vector (containsNull = true)
我需要计算vector1
和vector2 列之间的欧几里得距离或余弦相似性。
我该如何使用Pyspark进行操作?
I have a Spark dataframe in the following form:
> df1
+---------------+----------------+
| vector1| vector2|
+---------------+----------------+
|[[0.9,0.5,0.2]]| [[0.1,0.3,0.2]]|
|[[0.8,0.7,0.1]]| [[0.8,0.4,0.2]]|
|[[0.9,0.2,0.8]]| [[0.3,0.1,0.8]]|
+---------------+----------------+
> df1.printSchema()
root
|-- vector1: array (nullable = true)
| |-- element: vector (containsNull = true)
|-- vector2: array (nullable = true)
| |-- element: vector (containsNull = true)
I need to calculate Euclidean distance or cosine similarity between vector1
and vector2
columns.
How can I do this using PySpark?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
●当列是数组类型的列:
完整测试:
●如果列是向量类型的,我会首先将它们转换为数组:
完整测试:
● When columns are of array type:
Full test:
● If columns are of vector type, I would first convert them to arrays:
Full test:
让我们尝试熊猫UDF。它的矢量和更快。
DF
结果
Lets try pandas udf. its vectorised and faster.
df
outcome