访问动态生成的嵌套字典

发布于 2024-10-26 08:11:49 字数 757 浏览 2 评论 0原文

目标是能够尽快将文档中的单词与一组文档中的单词进行比较(创建术语文档矩阵)。如果可能的话可以使用 Lucene 来完成(并且会很快)吗?

我的想法(如果由我完成)是在每个文档中创建一个术语树,然后将树组合在一起以形成集合。为了创建树,我将使用嵌套字典,每个字典键都是一个字符。术语中的每个位置将是层次结构中的不同级别

位置

1
 2
  3
   4
    5
     6

例如,使用示例字符串“这是一个测试”,树将如下所示

t
 h
  i
   s
 e
  s
   t
i
 s
a

请注意,第一级别中的“t”仅存在一次。第一个字典将包含键 {'t','i','a'}。将存在三个包含键 {'h'}{'e'}{'s'} 的二级字典。

这应该会使查找速度变得非常快。循环中的最大步数是最长单词中的字符数。让我的大脑自我折叠的部分是我如何动态地构建这样的字典,特别是访问正确的级别

到目前为止,我有一些效果

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1 

The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?

My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy

Positions

1
 2
  3
   4
    5
     6

For example, using a sample string "This is a test" the tree would look like

t
 h
  i
   s
 e
  s
   t
i
 s
a

Notice the 't' in the first level is there only once. The first dictionary would contain the keys {'t','i','a'}. There would be three second level dictionaries containing the keys {'h'}{'e'}{'s'}.

This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is how do I dynamically build a dictionary like this, specifically accessing the correct level

So far I have something to the effect of

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

熊抱啵儿 2024-11-02 08:11:49

我可以看到您当前的实施存在一些问题。如何标记 trie 中的节点是否是单词?更好的实现是将树初始化为类似 tree = [{}, None] 的内容,其中 None 指示当前节点是否是单词的结尾。

您的 addTerm 方法可能类似于:

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

如果您不关心节点上的单词是什么,您可以将 node[1] 设置为 True。

搜索单词是否在 trie 中会是这样的

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None

I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like tree = [{}, None] where None indicates if the current node is the end of a word.

Your addTerm method could then be something like:

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

You could set node[1] to True if you don't care about what word is at the node.

Searching if a word is in the trie would be something like

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None
拿命拼未来 2024-11-02 08:11:49

我最近读到 John Resig 写的一篇关于做类似事情的文章。实现是 Javascript,但是使用这个想法并翻译代码应该很容易做到。至少它会给您带来一些在实施中需要考虑的极端情况问题。

http://ejohn.org/blog/javascript-trie-performance-analysis/

I read an article about doing a very similar thing recently by John Resig. The implementation is Javascript, but using the idea and translating the code should be quite easy to do. At least it will give you some corner-case issues to consider in your implementation.

http://ejohn.org/blog/javascript-trie-performance-analysis/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文