计算向量与不同火花数据框的距离

发布于 2025-02-02 20:15:26 字数 1515 浏览 2 评论 0原文

我有两个火花数据框架:

> df1
+--------+-----------------------------+
|   word |                     init_vec|
+--------+-----------------------------+
|  venus |[-0.235, -0.060, -0.609, ...]|
+--------+-----------------------------+


> df2
+-----------------------------+-----+
|                    targ_vec |   id|
+-----------------------------+-----+
|[-0.272, -0.070, -0.686, ...]| 45ha|
+-----------------------------+-----+
|[-0.234, -0.060, -0.686, ...]| 98pb|
+-----------------------------+-----+
|[-0.562, -0.334, -0.981, ...]| c09j|
+-----------------------------+-----+

我需要从df1targ_vec of of of of of of of of df2 init_vec之间的欧几里得距离。 > dataframe,然后将最接近3个最接近的向量返回到init_vec

    > desired_output
    +--------+-----------------------------+-----+----------+
    |   word |                     targ_vec|   id|  distance|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.234, -0.060, -0.686, ...]| 98pb|some_value|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.221, -0.070, -0.613, ...]| tg67|some_value|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.240, -0.091, -0.676, ...]| jhg6|some_value|
    +--------+-----------------------------+-----+----------+

我需要使用Pyspark实施此功能。

I have two Spark dataframes:

> df1
+--------+-----------------------------+
|   word |                     init_vec|
+--------+-----------------------------+
|  venus |[-0.235, -0.060, -0.609, ...]|
+--------+-----------------------------+


> df2
+-----------------------------+-----+
|                    targ_vec |   id|
+-----------------------------+-----+
|[-0.272, -0.070, -0.686, ...]| 45ha|
+-----------------------------+-----+
|[-0.234, -0.060, -0.686, ...]| 98pb|
+-----------------------------+-----+
|[-0.562, -0.334, -0.981, ...]| c09j|
+-----------------------------+-----+

I need to find euclidean distance between init_vec from df1 and each vector from targ_vec of df2 dataframe and return top 3 closest vector to init_vec.

    > desired_output
    +--------+-----------------------------+-----+----------+
    |   word |                     targ_vec|   id|  distance|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.234, -0.060, -0.686, ...]| 98pb|some_value|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.221, -0.070, -0.613, ...]| tg67|some_value|
    +--------+-----------------------------+-----+----------+
    |  venus |[-0.240, -0.091, -0.676, ...]| jhg6|some_value|
    +--------+-----------------------------+-----+----------+

I need to implement this using PySpark.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

分开我的手 2025-02-09 20:15:26

在DF1和DF2之间的交叉接合以将DF1.Init_Vec添加到DF2的所有行中:

df1 = (df1
      .withColumn('distance', f.sqrt(f.expr('aggregate(transform(targ_vec, (element, idx) -> power(abs(element - element_at(init_vec, cast(idx + 1 as int))), 2)), cast(0 as double), (acc, value) -> acc + value)')))
     )

然后,您可以对数据框进行排序,并将3行保持在最小距离值的情况下。

After a cross join between df1 and df2 to add the df1.init_vec to all the rows of df2:

df1 = (df1
      .withColumn('distance', f.sqrt(f.expr('aggregate(transform(targ_vec, (element, idx) -> power(abs(element - element_at(init_vec, cast(idx + 1 as int))), 2)), cast(0 as double), (acc, value) -> acc + value)')))
     )

Then you can sort the dataframe and keep the 3 rows with the least distance values.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文