计算向量与不同火花数据框的距离
我有两个火花数据框架:
> df1
+--------+-----------------------------+
| word | init_vec|
+--------+-----------------------------+
| venus |[-0.235, -0.060, -0.609, ...]|
+--------+-----------------------------+
> df2
+-----------------------------+-----+
| targ_vec | id|
+-----------------------------+-----+
|[-0.272, -0.070, -0.686, ...]| 45ha|
+-----------------------------+-----+
|[-0.234, -0.060, -0.686, ...]| 98pb|
+-----------------------------+-----+
|[-0.562, -0.334, -0.981, ...]| c09j|
+-----------------------------+-----+
我需要从df1
和targ_vec
of of
of
of
of
of
of
of
df2 init_vec
之间的欧几里得距离。 > dataframe,然后将最接近3个最接近的向量返回到init_vec
。
> desired_output
+--------+-----------------------------+-----+----------+
| word | targ_vec| id| distance|
+--------+-----------------------------+-----+----------+
| venus |[-0.234, -0.060, -0.686, ...]| 98pb|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.221, -0.070, -0.613, ...]| tg67|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.240, -0.091, -0.676, ...]| jhg6|some_value|
+--------+-----------------------------+-----+----------+
我需要使用Pyspark实施此功能。
I have two Spark dataframes:
> df1
+--------+-----------------------------+
| word | init_vec|
+--------+-----------------------------+
| venus |[-0.235, -0.060, -0.609, ...]|
+--------+-----------------------------+
> df2
+-----------------------------+-----+
| targ_vec | id|
+-----------------------------+-----+
|[-0.272, -0.070, -0.686, ...]| 45ha|
+-----------------------------+-----+
|[-0.234, -0.060, -0.686, ...]| 98pb|
+-----------------------------+-----+
|[-0.562, -0.334, -0.981, ...]| c09j|
+-----------------------------+-----+
I need to find euclidean distance between init_vec
from df1
and each vector from targ_vec
of df2
dataframe and return top 3 closest vector to init_vec
.
> desired_output
+--------+-----------------------------+-----+----------+
| word | targ_vec| id| distance|
+--------+-----------------------------+-----+----------+
| venus |[-0.234, -0.060, -0.686, ...]| 98pb|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.221, -0.070, -0.613, ...]| tg67|some_value|
+--------+-----------------------------+-----+----------+
| venus |[-0.240, -0.091, -0.676, ...]| jhg6|some_value|
+--------+-----------------------------+-----+----------+
I need to implement this using PySpark.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在DF1和DF2之间的交叉接合以将DF1.Init_Vec添加到DF2的所有行中:
然后,您可以对数据框进行排序,并将3行保持在最小距离值的情况下。
After a cross join between df1 and df2 to add the df1.init_vec to all the rows of df2:
Then you can sort the dataframe and keep the 3 rows with the least distance values.