如何在初始化方面改进我的 Trie 实现?
我正在尝试从一个巨大的单词列表中读入并以一种允许我稍后快速检索的方式存储它们。我首先想到使用 trie,我承认我的实现很幼稚,它基本上是嵌套的哈希表,每个键都是不同的字母。现在,在 trie 中插入一个单词需要很长时间(运行这个程序需要 20 多秒),我想知道是否有人对我可以做些什么来改进我的插入有任何想法?这不是为了家庭作业。
import string
import time
class Trie:
def __init__(self):
self.root = TrieNode()
def insert_word(self, word):
current_node = self.root
for letter in word:
trie_node = current_node.get_node(letter)
current_node = trie_node
class TrieNode:
def __init__(self):
self.data = {}
def get_node(self, letter):
if letter in self.data:
return self.data[letter]
else:
new_trie_node = TrieNode()
self.data[letter] = new_trie_node
return new_trie_node
def main():
start_time = time.time()
trie = Trie()
with open('/usr/share/dict/words', 'r') as dictionary:
word_list = dictionary.read()
word_list = word_list.split("\n")
for word in word_list:
trie.insert_word(word.lower())
print time.time() - start_time, "seconds"
if __name__ == "__main__":
main()
I'm trying to read in from a huge list of words and store them in a way that allows me to make quick retrievals later on. I first thought of using a trie and I'll admit my implementation is naive, it's basically nested hash tables with each key being a different letter. Right now it takes forever to insert a word into the trie (running this program takes 20+ seconds), and I was wondering if anyone had any ideas as to what I could do to improve my insertion? This isn't for homework.
import string
import time
class Trie:
def __init__(self):
self.root = TrieNode()
def insert_word(self, word):
current_node = self.root
for letter in word:
trie_node = current_node.get_node(letter)
current_node = trie_node
class TrieNode:
def __init__(self):
self.data = {}
def get_node(self, letter):
if letter in self.data:
return self.data[letter]
else:
new_trie_node = TrieNode()
self.data[letter] = new_trie_node
return new_trie_node
def main():
start_time = time.time()
trie = Trie()
with open('/usr/share/dict/words', 'r') as dictionary:
word_list = dictionary.read()
word_list = word_list.split("\n")
for word in word_list:
trie.insert_word(word.lower())
print time.time() - start_time, "seconds"
if __name__ == "__main__":
main()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在考虑搜索功能是否正常工作之前,加速 trie 初始化是完全没有意义的。
在 @unutbu 提到的代码中,你为什么认为它会与 < code>{'end':False} 和
pt['end']=True
?这里有一些测试数据供您使用:
这里有另一个想法:您没有表明您除了确定查询词是否存在之外还需要其他任何东西。如果这是正确的,您应该考虑根据内置
集
对您的 DIY trie 进行基准测试。标准:加载速度(考虑从 pickle 中执行此操作)、查询速度和内存使用情况。如果您确实想要比
bool
检索更多内容,请用dict
替换set
并重新阅读此答案。如果您确实想在输入字符串中搜索单词,那么您可以考虑 @unutbu 引用的代码,修复了错误并在
find
函数中进行了一些加速(评估len(input)
仅一次,使用xrange
而不是range
(Python 2.x)),并删除不必要的TERMINAL: False
条目:或你可以看看Danny Yoo 的快速实现 Aho-Corasick 算法。
It is utterly pointless working on speeding up your trie initialisation before you have considered whether your search facility is working or not.
In the code that @unutbu referred to, why do you imagine it is mucking about with
{'end':False}
andpt['end']=True
?Here is some test data for you:
And here's another thought: You give no indication that you want anything more than the ability to determine whether a query word is present or not. If that is correct, you should consider benchmarking your DIY trie against a built-in
set
. Criteria: load speed (consider doing this from a pickle), query speed, and memory usage.If you do want more retrieved than a
bool
, then substitutedict
forset
and re-read this answer.If you do want to search for words in an input string, then you could consider the code referenced by @unutbu, with bugs fixed and some speedups in the
find
function (evaluatelen(input)
only once, usexrange
instead ofrange
(Python 2.x)) and the unnecessaryTERMINAL: False
entries removed:or you could look at Danny Yoo's fast implementation of the Aho-Corasick algorithm.
此处还有 Trie 的替代实现。
比较
Trie.insert_word
与build
:对于
Trie
,对于word
中的每个letter
,insert_word
调用current_node.get_node(letter)
。此方法有一个if
和else
块,并且每当到达else
块时都必须实例化一个新的TrieNode
,然后一个新的键值对被插入到 self.data 字典中。使用
build
,trie 本身只是一个字典。对于word
中的每个字母
,只需调用一次pt.setdefault(...)
。dict
方法用 C 实现,比用 Python 实现类似代码更快。timeit
显示大约 2 倍的速度差异(有利于build
):There is an alternate implementation of Trie here.
Compare
Trie.insert_word
tobuild
:With
Trie
, for eachletter
inword
,insert_word
callscurrent_node.get_node(letter)
. This method has anif
andelse
block, and must instantiate a newTrieNode
whenever theelse
block is reached, and a new key-value pair is then inserted into theself.data
dict.With
build
, the trie itself is just a dict. For eachletter
inword
, there is simply one call topt.setdefault(...)
.dict
methods are implemented in C and are faster than implementing similar code in Python.timeit
shows about a 2x speed difference (in favor ofbuild
):