使用 PyLucene 作为 K-NN 分类器
我有一个由数百万个示例组成的数据集,其中每个示例包含 128 个按名称分类的连续值特征。我正在尝试找到一个强大的大型数据库/索引来用作高维数据的 KNN 分类器。我尝试了 Weka 的 IBk 分类器,但是它会因为这么多数据而窒息,即使如此,也必须将其加载到内存中。 Lucene(特别是通过 PyLucene 接口)是否是一个可能的替代方案?
我发现了Lire,它似乎以类似的方式使用Lucene,但是在检查了代码之后,我不确定他们是如何实现的,或者这是否是我正在尝试做的事情。
我意识到 Lucene 被设计为文本索引工具,而不是通用分类器,但是可以以这种方式使用它吗?
I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?
I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.
I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
鉴于您告诉我们的情况,Lucene 似乎不是正确的选择。 Lucene 将为您提供一种存储数据的方法,但在检索方面,它除了搜索文本字符串之外没有其他任何用途。
由于 K-NN 非常简单,因此您最好在典型的 RDBMS 或 Berkeley DB 之类的东西中创建自己的数据存储。您可以根据各个维度的子超立方体创建键/索引以加快速度 - 从要分类的项目的存储桶开始并向外移动......
Lucene doesn't seem like the right choice given what you've told us. Lucene would give you a way to store the data, but in terms of retrieval, it's not designed to do anything but search over textual strings.
Since K-NN is so simple, you might be better off creating your own data store in a typical RDBMS or something like Berkeley DB. You could create keys/indicies based on sub-hypercubes of the various dimensions to speed things up - start at the bucket of the item to be classified and move outward...
这已在 Lucene 中通过地理空间搜索完成。当然,内置地理空间搜索仅使用二维,因此您将拥有稍微修改一下。但使用数字范围查询的基本思想是可行的。
(注:我不知道有人用 Lucene 做高维 kNN。所以我无法评论它有多快。)
This is done in Lucene already with geospatial searches. Of course, the built-in geospatial searches only use two dimensions, so you'll have to modify it a bit. But the basic idea of using numeric range queries will work.
(Note: I'm not aware of anyone doing high-dimensional kNN with Lucene. So I can't comment on how fast it will be.)