如何在初始化方面改进我的 Trie 实现？

发布于 2024-12-07 13:56:34 字数 1128 浏览 2 评论 0原文

我正在尝试从一个巨大的单词列表中读入并以一种允许我稍后快速检索的方式存储它们。我首先想到使用 trie，我承认我的实现很幼稚，它基本上是嵌套的哈希表，每个键都是不同的字母。现在，在 trie 中插入一个单词需要很长时间（运行这个程序需要 20 多秒），我想知道是否有人对我可以做些什么来改进我的插入有任何想法？这不是为了家庭作业。

import string
import time

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert_word(self, word):
        current_node = self.root
        for letter in word:
            trie_node = current_node.get_node(letter)
            current_node = trie_node

class TrieNode:

    def __init__(self):
        self.data = {}

    def get_node(self, letter):
        if letter in self.data:
            return self.data[letter]
        else:
            new_trie_node = TrieNode()
            self.data[letter] = new_trie_node
            return new_trie_node

def main():
    start_time = time.time()
    trie = Trie()

    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")

    for word in word_list:
        trie.insert_word(word.lower())

    print time.time() - start_time, "seconds"


if __name__ == "__main__":
    main()

原文

I'm trying to read in from a huge list of words and store them in a way that allows me to make quick retrievals later on. I first thought of using a trie and I'll admit my implementation is naive, it's basically nested hash tables with each key being a different letter. Right now it takes forever to insert a word into the trie (running this program takes 20+ seconds), and I was wondering if anyone had any ideas as to what I could do to improve my insertion? This isn't for homework.

import string
import time

class Trie:

    def __init__(self):
        self.root = TrieNode()

    def insert_word(self, word):
        current_node = self.root
        for letter in word:
            trie_node = current_node.get_node(letter)
            current_node = trie_node

class TrieNode:

    def __init__(self):
        self.data = {}

    def get_node(self, letter):
        if letter in self.data:
            return self.data[letter]
        else:
            new_trie_node = TrieNode()
            self.data[letter] = new_trie_node
            return new_trie_node

def main():
    start_time = time.time()
    trie = Trie()

    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")

    for word in word_list:
        trie.insert_word(word.lower())

    print time.time() - start_time, "seconds"


if __name__ == "__main__":
    main()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹉夏雨初晴づ 2024-12-14 13:56:34

在考虑搜索功能是否正常工作之前，加速 trie 初始化是完全没有意义的。

在 @unutbu 提到的代码中，你为什么认为它会与 < code>{'end':False} 和 pt['end']=True ？

这里有一些测试数据供您使用：

words_to_insert = ['foo', 'foobar']
queries_expecting_true = words_to_insert
queries_expecting_false = "fo foe foob food foobare".split()

这里有另一个想法：您没有表明您除了确定查询词是否存在之外还需要其他任何东西。如果这是正确的，您应该考虑根据内置集对您的 DIY trie 进行基准测试。标准：加载速度（考虑从 pickle 中执行此操作）、查询速度和内存使用情况。

如果您确实想要比 bool 检索更多内容，请用 dict 替换 set 并重新阅读此答案。

如果您确实想在输入字符串中搜索单词，那么您可以考虑 @unutbu 引用的代码，修复了错误并在 find 函数中进行了一些加速（评估 len(input) 仅一次，使用 xrange 而不是 range (Python 2.x))，并删除不必要的 TERMINAL: False 条目：

TERMINAL = None # Marks the end of a word

def build(words, trie=None): # bugs fixed
    if trie is None:
        trie = {}
    for word in words:
        if not word: continue # bug fixed
        pt = trie # bug fixed
        for ch in word:
            pt = pt.setdefault(ch, {})
        pt[TERMINAL] = True
    return trie

def find(input, trie):
    len_input = len(input)
    results = []
    for i in xrange(len_input):
        pt = trie
        for j in xrange(i, len_input + 1):
            if TERMINAL in pt:
                results.append(input[i:j])
            if j >= len_input or input[j] not in pt:
                break
            pt = pt[input[j]]
    return results

或你可以看看Danny Yoo 的快速实现 Aho-Corasick 算法。

It is utterly pointless working on speeding up your trie initialisation before you have considered whether your search facility is working or not.

In the code that @unutbu referred to, why do you imagine it is mucking about with {'end':False} and pt['end']=True ?

Here is some test data for you:

words_to_insert = ['foo', 'foobar']
queries_expecting_true = words_to_insert
queries_expecting_false = "fo foe foob food foobare".split()

And here's another thought: You give no indication that you want anything more than the ability to determine whether a query word is present or not. If that is correct, you should consider benchmarking your DIY trie against a built-in set. Criteria: load speed (consider doing this from a pickle), query speed, and memory usage.

If you do want more retrieved than a bool, then substitute dict for set and re-read this answer.

If you do want to search for words in an input string, then you could consider the code referenced by @unutbu, with bugs fixed and some speedups in the find function (evaluate len(input) only once, use xrange instead of range (Python 2.x)) and the unnecessary TERMINAL: False entries removed:

TERMINAL = None # Marks the end of a word

def build(words, trie=None): # bugs fixed
    if trie is None:
        trie = {}
    for word in words:
        if not word: continue # bug fixed
        pt = trie # bug fixed
        for ch in word:
            pt = pt.setdefault(ch, {})
        pt[TERMINAL] = True
    return trie

def find(input, trie):
    len_input = len(input)
    results = []
    for i in xrange(len_input):
        pt = trie
        for j in xrange(i, len_input + 1):
            if TERMINAL in pt:
                results.append(input[i:j])
            if j >= len_input or input[j] not in pt:
                break
            pt = pt[input[j]]
    return results

or you could look at Danny Yoo's fast implementation of the Aho-Corasick algorithm.

回复收藏 0 原文

丢了幸福的猪 2024-12-14 13:56:34

此处还有 Trie 的替代实现。

比较 Trie.insert_word 与 build：

def build(words,trie={'end':False}):
    '''
    build builds a trie in O(M*L) time, where
        M = len(words)
        L = max(map(len,words))
    '''
    for word in words:
        pt=trie
        for letter in word:
            pt=pt.setdefault(letter, {'end':False})
        pt['end']=True
    return trie

对于 Trie，对于 word 中的每个 letter ，insert_word 调用current_node.get_node(letter)。此方法有一个 if 和 else 块，并且每当到达 else 块时都必须实例化一个新的 TrieNode，然后一个新的键值对被插入到 self.data 字典中。

使用build，trie 本身只是一个字典。对于word 中的每个字母，只需调用一次pt.setdefault(...)。 dict 方法用 C 实现，比用 Python 实现类似代码更快。

timeit 显示大约 2 倍的速度差异（有利于 build）：

def alt_main():
    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")
    return build(word_list)


% python -mtimeit -s'import test' 'test.main()'
10 loops, best of 3: 1.16 sec per loop

% python -mtimeit -s'import test' 'test.alt_main()'
10 loops, best of 3: 571 msec per loop

There is an alternate implementation of Trie here.

Compare Trie.insert_word to build:

def build(words,trie={'end':False}):
    '''
    build builds a trie in O(M*L) time, where
        M = len(words)
        L = max(map(len,words))
    '''
    for word in words:
        pt=trie
        for letter in word:
            pt=pt.setdefault(letter, {'end':False})
        pt['end']=True
    return trie

With Trie, for each letter in word, insert_word calls current_node.get_node(letter). This method has an if and else block, and must instantiate a new TrieNode whenever the else block is reached, and a new key-value pair is then inserted into the self.data dict.

With build, the trie itself is just a dict. For each letter in word, there is simply one call to pt.setdefault(...). dict methods are implemented in C and are faster than implementing similar code in Python.

timeit shows about a 2x speed difference (in favor of build):

def alt_main():
    with open('/usr/share/dict/words', 'r') as dictionary:
        word_list = dictionary.read()
    word_list = word_list.split("\n")
    return build(word_list)


% python -mtimeit -s'import test' 'test.main()'
10 loops, best of 3: 1.16 sec per loop

% python -mtimeit -s'import test' 'test.alt_main()'
10 loops, best of 3: 571 msec per loop

回复收藏 0 原文

~没有更多了~