快速内存倒排索引
我正在寻找通用倒排索引的快速内存实现。我所需要的只是存储几百万个实体的权重特征,并使用倒排索引使用各种距离函数计算实体之间的相似性。
我可以将实体的所有其他属性存储在一些快速键值存储中。
我希望我可以将 Lucene 用作倒排索引,但不知道如何将我自己的自定义特征向量与预先计算的权重关联到文档。任何建议将不胜感激!
谢谢。
I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我一直在做一些类似的工作,并且发现 redis 的 zset 几乎是我所需要的(尽管我现在实际上并没有使用它;我已经基于内存映射文件推出了自己的解决方案)。
基本上,zset 是一组已排序的键值对。
因此,您可以为每个功能设置一个排序集,其中每个
特征->[ { docid, 分数 }, {docid, 分数} ..]
即
zadd 特征评分 docid
然后,redis 有一些很好的运算符来合并、提取范围等。请参阅 zunionstore、zrange (http://redis.io/commands/zunionstore)。
非常快(据说)并且全部在内存中等等......(尽管redis不是嵌入式数据库)。
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
您看过梗犬吗?我不太确定它是否有内存索引,但它在索引和评分方面比 Lucene 更具可扩展性。
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene 允许您存储几乎所有与文档关联的数据。它还具有称为“有效负载”的功能,允许您在与文档中的术语关联的索引中存储任意数据。所以我认为你想要的是将你的“特征”存储为索引中的术语,并将权重存储为有效负载,并且你应该能够让 Lucene 做你想做的事。它确实有一个内存索引实现。
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
如果您想要比较的实体对已经预先给出,并且您对成对分数感兴趣,那么我认为 Lucene 不会给您带来任何优势。只需在某个键值存储中查找向量并计算相似度即可。考虑使用稀疏向量表示来提高空间和时间效率。
如果提前只给出一个实体,并且您对类似排名的场景更感兴趣,Lucene 可能值得一试。
正确的地方是
你应该能够根据你的需要调整它并将你的版本设置为默认版本,但是
我会谨慎对待速度增益的期望(遍历所有),因为它们很大程度上取决于稀疏性(查询的)和您选择实现的评分函数。另请注意,Lucene 使用两阶段检索方案,首先是布尔值(“包含的所有 AND 术语?任何 OR 术语?”),然后对通过的内容进行评分。而对于 tf.idf 来说,您在使用其他评分函数时可能不会损失任何东西。
对于更通用的有效近似最近邻搜索的方法,可能值得研究 LSH:
http:// en.wikipedia.org/wiki/Locality-sensitive_hashing
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
you should be able to adapt it to your needs and set your version as default with
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing