我想创建一个大约 106 项的大型倒排索引。你会建议什么方法?我正在考虑快速二进制密钥存储数据库,如东京橱柜、伏地魔等。编辑:我过去曾尝试过使用 MySQL 来存储一个由两个整数组成的表来表示倒排索引,但即使如此由于第一列有数据库索引,查询速度非常慢。我认为对于这些情况,SQL 数据库有太多的开销、事务开销、查询解析等。我正在寻找哪些技术或算法方法可以在具有良好的响应时间和性能的同时进行扩展。我正在出于研究目的推出自己的解决方案。
I want to create a big inverted index of around 106 terms. What method would you suggest? I'm thinking in fast binary key store DBs like Tokyo cabinet, voldemort, etc. Edit: I've tried MySQL in the past for storing a table of two integers to represent the inverted index, but even with the first column having a db index, queries were very slow. I think for those situations a SQL database has too much overhead, overhead of transactions, query parsing, etc. I'm searching for what technologies or algorithmic approaches would scale while having good response times and performance. I'm rolling my own solution for research purposes.
发布评论
评论(3)
这个问题有点模糊,所以我认为我能给出的唯一答案是:使用“广义倒排索引”(GIN 索引) 在 PostgreSQL 中创建任何你想要的倒排索引。所有艰苦的工作都为您完成:它使用预写日志来保证崩溃安全,内部使用 btree 结构来提高性能,并且它是成熟数据库管理系统的一部分。
如果您的问题是全文搜索,那么 postgresql 的 全文搜索已经为您构建并且可以在内部使用 GIN。
The question is somewhat vague, so I think the only answer I can give is: use a "generalized inverted index" (GIN index) in PostgreSQL to create whatever kind of inverted index you want. All the hard work is done for you: it uses the write-ahead log for crash safety, internally uses btree structures for performance, and it's part of a mature database management system.
If your problem is full text search, then postgresql's full-text search is already built for you and can use GIN internally.
你尝试推出自己的产品真是太酷了。也许研究 Lucene 的倒排索引文件格式?
http://lucene.apache.org/java/3_1_0/fileformats.html
That is very cool you're trying to roll your own. Perhapstudy up on Lucene's inverted index file format?
http://lucene.apache.org/java/3_1_0/fileformats.html
是的,一定要考虑 Lucene 用于索引,因为它基本上是目前最杰出的索引器。事实上,我目前正在考虑用它来索引我的图像数据库。 “默认”语言是 Java,但它已被移植到其他语言,例如 CLucene 用于 C++ , PyLucene 用于 python。
可以找到快速教程 此处。
Yes, definitely consider Lucene for indexing as its basically the pre-eminent indexer right now. In fact I'm currently considering it for indexing my database of images. The "default" language is Java but it has been ported to other languages such as CLucene for C++, PyLucene for python.
A quick tutorial can be found here.