访问动态生成的嵌套字典
目标是能够尽快将文档中的单词与一组文档中的单词进行比较(创建术语文档矩阵)。如果可能的话可以使用 Lucene 来完成(并且会很快)吗?
我的想法(如果由我完成)是在每个文档中创建一个术语树,然后将树组合在一起以形成集合。为了创建树,我将使用嵌套字典,每个字典键都是一个字符。术语中的每个位置将是层次结构中的不同级别
位置
1
2
3
4
5
6
例如,使用示例字符串“这是一个测试”,树将如下所示
t
h
i
s
e
s
t
i
s
a
请注意,第一级别中的“t”仅存在一次。第一个字典将包含键 {'t','i','a'}。将存在三个包含键 {'h'}{'e'}{'s'} 的二级字典。
这应该会使查找速度变得非常快。循环中的最大步数是最长单词中的字符数。让我的大脑自我折叠的部分是我如何动态地构建这样的字典,特别是访问正确的级别
到目前为止,我有一些效果
def addTerm(self, term):
current_level = 0;
for character in list(term):
character = character.lower()
if re.match("[a-z]",character):
self.tree[character] = {}
current_level += 1
The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?
My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy
Positions
1
2
3
4
5
6
For example, using a sample string "This is a test" the tree would look like
t
h
i
s
e
s
t
i
s
a
Notice the 't' in the first level is there only once. The first dictionary would contain the keys {'t','i','a'}. There would be three second level dictionaries containing the keys {'h'}{'e'}{'s'}.
This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is how do I dynamically build a dictionary like this, specifically accessing the correct level
So far I have something to the effect of
def addTerm(self, term):
current_level = 0;
for character in list(term):
character = character.lower()
if re.match("[a-z]",character):
self.tree[character] = {}
current_level += 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我可以看到您当前的实施存在一些问题。如何标记 trie 中的节点是否是单词?更好的实现是将树初始化为类似
tree = [{}, None]
的内容,其中 None 指示当前节点是否是单词的结尾。您的 addTerm 方法可能类似于:
如果您不关心节点上的单词是什么,您可以将
node[1]
设置为 True。搜索单词是否在 trie 中会是这样的
I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like
tree = [{}, None]
where None indicates if the current node is the end of a word.Your addTerm method could then be something like:
You could set
node[1]
to True if you don't care about what word is at the node.Searching if a word is in the trie would be something like
我最近读到 John Resig 写的一篇关于做类似事情的文章。实现是 Javascript,但是使用这个想法并翻译代码应该很容易做到。至少它会给您带来一些在实施中需要考虑的极端情况问题。
http://ejohn.org/blog/javascript-trie-performance-analysis/
I read an article about doing a very similar thing recently by John Resig. The implementation is Javascript, but using the idea and translating the code should be quite easy to do. At least it will give you some corner-case issues to consider in your implementation.
http://ejohn.org/blog/javascript-trie-performance-analysis/