存储全文索引以支持不完整单词查找的最精简方法是什么?

发布于 2024-12-16 18:34:28 字数 371 浏览 0 评论 0原文

存储支持不完整单词查找的(大)全文索引的最精简方法是什么?例如,colo 的索引查找应返回 Colorado(除其他外)。作为上下文,我正在对大约 60,000 个地理实体(国家、地区/州、都会区和城市)建立索引。

在我的第一次尝试中,我索引了一个单词中的所有子字符串,从第一个字符开始,从两个字符长度一直到整个单词。例如,对于单词“Colorado”,我创建了以下索引条目:

co
col
colo
color
colora
colorad
colorado

但这导致了 160,000 个索引条目。我试图将其减少到更合理的程度,同时保留匹配不完整单词的能力并防止索引条目的数量激增。我应该考虑哪些优化来使索引更小?

What is the leanest way to store a (big) full-text index that supports lookup of incomplete words? For example, an index lookup for colo should return Colorado (among other things). For context, I am indexing about 60,000 geographical entities (countries, regions/states, metro areas, and cities).

In my first attempt I indexed all substrings in a word starting with the first character from two characters in length up to the full word. For example, for the word "Colorado", I created the following index entries:

co
col
colo
color
colora
colorad
colorado

But that resulted in 160,000 index entries. I'm trying to reduce this down to something more reasonable while retaining the ability to match on incomplete words and keeping the number of index entries from blowing up. What optimizations should I consider to make the index smaller?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

甚是思念 2024-12-23 18:34:28

我的建议是使用 Trie 的空间紧凑版本,例如 Radix Tree 此处在 python 中有一个很好的实现。

radix tree

Web 服务

您可以设置一个单独的 Web 服务器来提供此查找服务,例如,使用 Flask

示例代码

一些示例代码

  • 使用 python-radix-tree 加载预定义的地名,并
  • 完成歧义开始点的前缀,并
  • 查找最多 10 条记录的所有前缀匹配。

如下:

from radix_tree import RadixTree

locations = [
    "los angeles",
    "san diego",
    "san francisco",
    "san marino",
    "santa monica"
]

trie = RadixTree()
for loc in locations:
    trie.insert(loc, loc)

print trie.complete("s")
print trie.search_prefix('san', 10)

示例代码的结果

san
['santa monica', 'san diego', 'san francisco', 'san marino']

My recommendation is to use a space-compact version of Trie, e.g., Radix Tree. There is a good implementation here in python.

radix tree

Web service

You can set up a separate web server for providing this lookup service, e.g., using Flask.

Sample code

Some sample codes to

  • load predefined place names using python-radix-tree and
  • complete a prefix to the point where ambiguity starts and
  • find all prefix matches up to 10 records.

are at below:

from radix_tree import RadixTree

locations = [
    "los angeles",
    "san diego",
    "san francisco",
    "san marino",
    "santa monica"
]

trie = RadixTree()
for loc in locations:
    trie.insert(loc, loc)

print trie.complete("s")
print trie.search_prefix('san', 10)

Result of sample code

san
['santa monica', 'san diego', 'san francisco', 'san marino']
梦开始←不甜 2024-12-23 18:34:28

我认为只有当节点有两个子节点时才应该对其进行分支,例如“colorad”上不分支。

我认为您还应该能够将所有内容保存在一个文件中,以避免为存储的每几个字节支付 4KB 的开销,甚至 60,000 个对象也不会很大,平均每行 30 字节为您提供约 1.8 MB :)

I think you should only branch a node if it has two children, e.g. no branching on 'colorad'.

I think you should also be able to keep it all in one file to avoid paying 4KB in overhead for each few bytes you store and even 60,000 objects are not going to be very large, an average of 30 bytes per line gives you ~1.8 MB :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文