使用 PyLucene 作为 K-NN 分类器

发布于 2024-10-30 08:06:50 字数 552 浏览 1 评论 0原文

我有一个由数百万个示例组成的数据集,其中每个示例包含 128 个按名称分类的连续值特征。我正在尝试找到一个强大的大型数据库/索引来用作高维数据的 KNN 分类器。我尝试了 Weka 的 IBk 分类器,但是它会因为这么多数据而窒息,即使如此,也必须将其加载到内存中。 Lucene(特别是通过 PyLucene 接口)是否是一个可能的替代方案?

我发现了Lire,它似乎以类似的方式使用Lucene,但是在检查了代码之后,我不确定他们是如何实现的,或者这是否是我正在尝试做的事情。

我意识到 Lucene 被设计为文本索引工具,而不是通用分类器,但是可以以这种方式使用它吗?

I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?

I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.

I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鹿童谣 2024-11-06 08:06:50

鉴于您告诉我们的情况,Lucene 似乎不是正确的选择。 Lucene 将为您提供一种存储数据的方法,但在检索方面,它除了搜索文本字符串之外没有其他任何用途。

由于 K-NN 非常简单,因此您最好在典型的 RDBMS 或 Berkeley DB 之类的东西中创建自己的数据存储。您可以根据各个维度的子超立方体创建键/索引以加快速度 - 从要分类的项目的存储桶开始并向外移动......

Lucene doesn't seem like the right choice given what you've told us. Lucene would give you a way to store the data, but in terms of retrieval, it's not designed to do anything but search over textual strings.

Since K-NN is so simple, you might be better off creating your own data store in a typical RDBMS or something like Berkeley DB. You could create keys/indicies based on sub-hypercubes of the various dimensions to speed things up - start at the bucket of the item to be classified and move outward...

旧夏天 2024-11-06 08:06:50

这已在 Lucene 中通过地理空间搜索完成。当然,内置地理空间搜索仅使用二维,因此您将拥有稍微修改一下。但使用数字范围查询的基本思想是可行的。

(注:我不知道有人用 Lucene 做高维 kNN。所以我无法评论它有多快。)

This is done in Lucene already with geospatial searches. Of course, the built-in geospatial searches only use two dimensions, so you'll have to modify it a bit. But the basic idea of using numeric range queries will work.

(Note: I'm not aware of anyone doing high-dimensional kNN with Lucene. So I can't comment on how fast it will be.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文