用于只读字符串列表(大约 100,000)的最有效的数据结构,具有快速前缀搜索

发布于 2024-07-26 14:44:29 字数 283 浏览 6 评论 0原文

我正在编写一个应用程序,需要从文件中读取字符串列表,将它们保存在数据结构中,然后通过前缀查找这些字符串。 字符串列表只是给定语言中的单词列表。 例如,如果搜索函数获取“stup”作为参数,则它应该返回[“stupid”,“stupidity”,“stupor”...]。 它应该在 O(log(n)*m) 时间内完成,其中 n 是数据结构的大小,m 是结果的数量,并且应该尽可能快。 内存消耗现在不是一个大问题。 我是用 python 写的,所以如果你能向我指出一个用 python 包装器在 c 中实现的合适的数据结构(最好),那就太好了。

I'm writing an application that needs to read a list of strings from a file, save them in a data structure, and then look up those strings by prefixes. The list of strings is simply a list of words in a given language. For example, if the search function gets "stup" as a parameter, it should return ["stupid", "stupidity", "stupor"...]. It should do so in O(log(n)*m) time, where n is the size of the data structure and m is the number of results and should be as fast as possible. Memory consumption is not a big issue right now. I'm writing this in python, so it would be great if you could point me to a suitable data structure (preferably) implemented in c with python wrappers.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

救赎№ 2024-08-02 14:44:58

字符串数组。

然后通过它进行二分搜索以搜索第一个匹配项
然后一步一步地完成所有后续匹配

(我最初也在这里有链接列表...但是当然这没有随机访问,所以这是“bs”(这可能解释了为什么我被否决)。我的二进制文件搜索算法仍然是最快的方法

string array.

then binary search through it to search the first match
then step one by one through it for all subsequent matches

(i originally had linked list here too... but of course this doesn't have random access so this was 'bs' (which probably explains why I was downvoted). My binary search algorithm still is the fastest way to go though

紫﹏色ふ单纯 2024-08-02 14:44:54

trie(或前缀树) 听起来很适合你。 我相信它可以在 O(m) 中对长度为 m 的前缀字符串进行搜索。

A trie (or prefix tree) sounds right up your alley. It can do the search on a prefix string of length m in O(m) I believe.

三岁铭 2024-08-02 14:44:46

你想要尝试一下。

http://en.wikipedia.org/wiki/Trie

我在 Scrabble 中使用过它们令人难以置信的程序。 它们非常适合您描述的用例(快速前缀查找)。

下面是一些用 Python 构建 trie 的示例代码。 这是我几个月前编写的一个 Boggle 程序。 剩下的就留给读者作为练习。 但是对于前缀检查,您基本上需要一个从根节点(变量 words)开始的方法,跟随前缀的字母到连续的子节点,如果找到这样的路径则返回 True,否则返回 False否则。

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')

You want a trie.

http://en.wikipedia.org/wiki/Trie

I've used them in Scrabble and Boggle programs. They're perfect for the use case you described (fast prefix lookup).

Here's some sample code for building up a trie in Python. This is from a Boggle program I whipped together a few months ago. The rest is left as an exercise to the reader. But for prefix checking you basically need a method that starts at the root node (the variable words), follows the letters of the prefix to successive child nodes, and returns True if such a path is found and False otherwise.

class Node(object):
    def __init__(self, letter='', final=False):
        self.letter = letter
        self.final = final
        self.children = {}
    def __contains__(self, letter):
        return letter in self.children
    def get(self, letter):
        return self.children[letter]
    def add(self, letters, n=-1, index=0):
        if n < 0: n = len(letters)
        if index >= n: return
        letter = letters[index]
        if letter in self.children:
            child = self.children[letter]
        else:
            child = Node(letter, index==n-1)
            self.children[letter] = child
        child.add(letters, n, index+1)

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

words = load_dictionary('dictionary.txt')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文