Sphinx 和 GUID
我们有一个数据库,希望使用 Sphinx 搜索引擎对其进行索引。问题是 Sphinx 需要每个文档的整数 ID,但我们的数据库将 GUID(实际上是类似 GUID 的随机字符串)作为主键。我可以动态生成合成ID如本配方中所述< /a>,但它仅适用于完整索引。如果我想要增量索引或运行时索引怎么办?是否有任何使用 Sphinx 处理 GUID 的最佳实践,可以与增量索引和运行时索引一起使用?有问题的数据库会非常大,所以我不想频繁地重新索引它们。
We have a database that we would like to index with Sphinx search engine. The problem is that Sphinx requires integer ID for each document but our database has GUIDs (actually, random GUID-like strings) as primary keys. I could generate synthetic IDs on the fly as described in this recipe, but it is good only for full indexes. What if I want to have incremental indexes or runtime index? Are there any best practices to deal with GUIDs using Sphinx that would work with incremental indexes and runtime indexes? The databases in question would be quite large so I wouldn't want to reindex them frequently.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
实时索引:您可以将非数字 GUID 转换为数字值(您可以使用 crc32 或 md5 转换为数字),如下所示:
测试或 google 找出哪种算法会给出更少的结果碰撞。
一旦您拥有数字 ID,您就可以将文档插入到 Sphinx RT 索引中。
增量索引:您可以使用上述相同方法将 GUID 转换为数字,但您必须记住索引的主要部分完成构建增量的位置。如果您的表中有一些“更新”或“插入”字段,您可以轻松完成此操作。 sql_query_range 将类似于“select ...更新位置>(从sphinx_helper中选择last_updated)”。在这种情况下,您不能基于 id,因为它们不是连续的。
我已经在 1180 万个不同的域上测试了一半的 md5 算法:
即没有发生冲突。因此,如果数据集中没有更多文档,那么使用一半的 md5 就可以了。
Real-time indexes: you can convert your non-numeric GUID to a numeric value (you can use crc32 or md5 casted to a number) like this:
test or google to figure out what algorithm will give less collisions.
Once you have numeric IDs you can just INSERT your docs to Sphinx RT index.
Incremental indexes: you can convert GUID to numbers using the same method described above, but you will have to remember somewhere where the main part of your index finishes to build the delta. You can do it easily if you have some "updated" or "inserted" field in your table. sql_query_range will be smth like "select ... where updated > (select last_updated from sphinx_helper)". You can't base on the ids in this case as they're not sequential.
I've tested the half of md5 algo on 11.8M different domains:
i.e. no collisions occured. So if you don't have much more documents in your dataset it should be ok to use half of md5.