pyspark mllib大约最近的邻居搜索多个键

发布于 2025-01-28 20:30:20 字数 395 浏览 2 评论 0原文

我想使用。我有一个100k键的数据框,我想在已经转换的Spark DataFrame上执行前十名ANN搜索。 But it seems that

当前,我想出的最好的事情是.collect()键并调用andnearestneighbors在驱动程序上的循环中,但效率非常低。

有谁知道我如何并行获得100k键的前十名ANN搜索?

谢谢。

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).

Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.

Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文