存储全文索引以支持不完整单词查找的最精简方法是什么？

发布于 2024-12-16 18:34:28 字数 371 浏览 5 评论 0原文

存储支持不完整单词查找的（大）全文索引的最精简方法是什么？例如，colo 的索引查找应返回 Colorado（除其他外）。作为上下文，我正在对大约 60,000 个地理实体（国家、地区/州、都会区和城市）建立索引。

在我的第一次尝试中，我索引了一个单词中的所有子字符串，从第一个字符开始，从两个字符长度一直到整个单词。例如，对于单词“Colorado”，我创建了以下索引条目：

co
col
colo
color
colora
colorad
colorado

但这导致了 160,000 个索引条目。我试图将其减少到更合理的程度，同时保留匹配不完整单词的能力并防止索引条目的数量激增。我应该考虑哪些优化来使索引更小？

原文

What is the leanest way to store a (big) full-text index that supports lookup of incomplete words? For example, an index lookup for colo should return Colorado (among other things). For context, I am indexing about 60,000 geographical entities (countries, regions/states, metro areas, and cities).

In my first attempt I indexed all substrings in a word starting with the first character from two characters in length up to the full word. For example, for the word "Colorado", I created the following index entries:

co
col
colo
color
colora
colorad
colorado

But that resulted in 160,000 index entries. I'm trying to reduce this down to something more reasonable while retaining the ability to match on incomplete words and keeping the number of index entries from blowing up. What optimizations should I consider to make the index smaller?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甚是思念 2024-12-23 18:34:28

我的建议是使用 Trie 的空间紧凑版本，例如 Radix Tree 。此处在 python 中有一个很好的实现。

radix tree

Web 服务

您可以设置一个单独的 Web 服务器来提供此查找服务，例如，使用 Flask。

示例代码

一些示例代码

使用 python-radix-tree 加载预定义的地名，并
完成歧义开始点的前缀，并
查找最多 10 条记录的所有前缀匹配。

如下：

from radix_tree import RadixTree

locations = [
    "los angeles",
    "san diego",
    "san francisco",
    "san marino",
    "santa monica"
]

trie = RadixTree()
for loc in locations:
    trie.insert(loc, loc)

print trie.complete("s")
print trie.search_prefix('san', 10)

示例代码的结果

san
['santa monica', 'san diego', 'san francisco', 'san marino']

My recommendation is to use a space-compact version of Trie, e.g., Radix Tree. There is a good implementation here in python.

radix tree

Web service

You can set up a separate web server for providing this lookup service, e.g., using Flask.

Sample code

Some sample codes to

load predefined place names using python-radix-tree and
complete a prefix to the point where ambiguity starts and
find all prefix matches up to 10 records.

are at below:

from radix_tree import RadixTree

locations = [
    "los angeles",
    "san diego",
    "san francisco",
    "san marino",
    "santa monica"
]

trie = RadixTree()
for loc in locations:
    trie.insert(loc, loc)

print trie.complete("s")
print trie.search_prefix('san', 10)

Result of sample code

san
['santa monica', 'san diego', 'san francisco', 'san marino']

回复收藏 0 原文

梦开始←不甜 2024-12-23 18:34:28

我认为只有当节点有两个子节点时才应该对其进行分支，例如“colorad”上不分支。

我认为您还应该能够将所有内容保存在一个文件中，以避免为存储的每几个字节支付 4KB 的开销，甚至 60,000 个对象也不会很大，平均每行 30 字节为您提供约 1.8 MB :)

回复收藏 0 原文

~没有更多了~

关于作者

玻璃人

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

存储全文索引以支持不完整单词查找的最精简方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

Web 服务

示例代码

示例代码的结果

Web service

Sample code

Result of sample code

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

存储全文索引以支持不完整单词查找的最精简方法是什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

Web 服务

示例代码

示例代码的结果

Web service

Sample code

Result of sample code

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。