存储全文索引以支持不完整单词查找的最精简方法是什么?
存储支持不完整单词查找的(大)全文索引的最精简方法是什么?例如,colo
的索引查找应返回 Colorado(除其他外)。作为上下文,我正在对大约 60,000 个地理实体(国家、地区/州、都会区和城市)建立索引。
在我的第一次尝试中,我索引了一个单词中的所有子字符串,从第一个字符开始,从两个字符长度一直到整个单词。例如,对于单词“Colorado”,我创建了以下索引条目:
co
col
colo
color
colora
colorad
colorado
但这导致了 160,000 个索引条目。我试图将其减少到更合理的程度,同时保留匹配不完整单词的能力并防止索引条目的数量激增。我应该考虑哪些优化来使索引更小?
What is the leanest way to store a (big) full-text index that supports lookup of incomplete words? For example, an index lookup for colo
should return Colorado (among other things). For context, I am indexing about 60,000 geographical entities (countries, regions/states, metro areas, and cities).
In my first attempt I indexed all substrings in a word starting with the first character from two characters in length up to the full word. For example, for the word "Colorado", I created the following index entries:
co
col
colo
color
colora
colorad
colorado
But that resulted in 160,000 index entries. I'm trying to reduce this down to something more reasonable while retaining the ability to match on incomplete words and keeping the number of index entries from blowing up. What optimizations should I consider to make the index smaller?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我的建议是使用
Trie
的空间紧凑版本,例如 Radix Tree 。 此处在 python 中有一个很好的实现。Web 服务
您可以设置一个单独的 Web 服务器来提供此查找服务,例如,使用 Flask。
示例代码
一些示例代码
python-radix-tree
加载预定义的地名,并如下:
示例代码的结果
My recommendation is to use a space-compact version of
Trie
, e.g., Radix Tree. There is a good implementation here in python.Web service
You can set up a separate web server for providing this lookup service, e.g., using Flask.
Sample code
Some sample codes to
python-radix-tree
andare at below:
Result of sample code
我认为只有当节点有两个子节点时才应该对其进行分支,例如“colorad”上不分支。
我认为您还应该能够将所有内容保存在一个文件中,以避免为存储的每几个字节支付 4KB 的开销,甚至 60,000 个对象也不会很大,平均每行 30 字节为您提供约 1.8 MB :)
I think you should only branch a node if it has two children, e.g. no branching on 'colorad'.
I think you should also be able to keep it all in one file to avoid paying 4KB in overhead for each few bytes you store and even 60,000 objects are not going to be very large, an average of 30 bytes per line gives you ~1.8 MB :)