访问动态生成的嵌套字典

发布于 2024-10-26 08:11:49 字数 757 浏览 2 评论 0原文

目标是能够尽快将文档中的单词与一组文档中的单词进行比较（创建术语文档矩阵）。如果可能的话可以使用 Lucene 来完成（并且会很快）吗？

我的想法（如果由我完成）是在每个文档中创建一个术语树，然后将树组合在一起以形成集合。为了创建树，我将使用嵌套字典，每个字典键都是一个字符。术语中的每个位置将是层次结构中的不同级别

位置

例如，使用示例字符串“这是一个测试”，树将如下所示

t
 h
  i
   s
 e
  s
   t
i
 s
a

请注意，第一级别中的“t”仅存在一次。第一个字典将包含键 {'t','i','a'}。将存在三个包含键 {'h'}{'e'}{'s'} 的二级字典。

这应该会使查找速度变得非常快。循环中的最大步数是最长单词中的字符数。让我的大脑自我折叠的部分是我如何动态地构建这样的字典，特别是访问正确的级别

到目前为止，我有一些效果

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1

原文

The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?

My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy

Positions

For example, using a sample string "This is a test" the tree would look like

t
 h
  i
   s
 e
  s
   t
i
 s
a

Notice the 't' in the first level is there only once. The first dictionary would contain the keys {'t','i','a'}. There would be three second level dictionaries containing the keys {'h'}{'e'}{'s'}.

This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is how do I dynamically build a dictionary like this, specifically accessing the correct level

So far I have something to the effect of

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

熊抱啵儿 2024-11-02 08:11:49

我可以看到您当前的实施存在一些问题。如何标记 trie 中的节点是否是单词？更好的实现是将树初始化为类似 tree = [{}, None] 的内容，其中 None 指示当前节点是否是单词的结尾。

您的 addTerm 方法可能类似于：

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

如果您不关心节点上的单词是什么，您可以将 node[1] 设置为 True。

搜索单词是否在 trie 中会是这样的

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None

I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like tree = [{}, None] where None indicates if the current node is the end of a word.

Your addTerm method could then be something like:

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

You could set node[1] to True if you don't care about what word is at the node.

Searching if a word is in the trie would be something like

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None

回复收藏 0 原文