Python:列表列表的字典

发布于 2024-09-26 02:21:14 字数 1915 浏览 7 评论 0原文

def makecounter():
     return collections.defaultdict(int)

class RankedIndex(object):
  def __init__(self):
    self._inverted_index = collections.defaultdict(list)
    self._documents = []
    self._inverted_index = collections.defaultdict(makecounter)


def index_dir(self, base_path):
    num_files_indexed = 0
    allfiles = os.listdir(base_path)
    self._documents = os.listdir(base_path)
    num_files_indexed = len(allfiles)
    docnumber = 0
    self._inverted_index = collections.defaultdict(list)

    docnumlist = []
    for file in allfiles: 
            self.documents = [base_path+file] #list of all text files
            f = open(base_path+file, 'r')
            lines = f.read()

            tokens = self.tokenize(lines)
            docnumber = docnumber + 1
            for term in tokens:  
                if term not in sorted(self._inverted_index.keys()):
                    self._inverted_index[term] = [docnumber]
                    self._inverted_index[term][docnumber] +=1                                           
                else:
                    if docnumber not in self._inverted_index.get(term):
                        docnumlist = self._inverted_index.get(term)
                        docnumlist = docnumlist.append(docnumber)
            f.close()
    print '\n \n'
    print 'Dictionary contents: \n'
    for term in sorted(self._inverted_index):
        print term, '->', self._inverted_index.get(term)
    return num_files_indexed
    return 0

执行此代码时出现索引错误:列表索引超出范围。

上面的代码生成一个字典索引,将“术语”存储为键,并将该术语出现的文档编号存储为列表。 例如:如果术语“cat”出现在文档 1.txt、5.txt 和 7.txt 中,则字典将具有: cat <- [1,5,7]

现在,我必须修改它以添加术语频率,因此如果单词 cat 在文档 1 中出现两次,在文档 5 中出现三次,在文档 7 中出现一次: 预期结果: term <-[[docnumber, term freq], [docnumber,term freq]] <--字典中的列表!!! cat <- [[1,2],[5,3],[7,1]]

我尝试了代码,但没有任何效果。我不知道如何修改这个数据结构来实现上述目标。

提前致谢。

def makecounter():
     return collections.defaultdict(int)

class RankedIndex(object):
  def __init__(self):
    self._inverted_index = collections.defaultdict(list)
    self._documents = []
    self._inverted_index = collections.defaultdict(makecounter)


def index_dir(self, base_path):
    num_files_indexed = 0
    allfiles = os.listdir(base_path)
    self._documents = os.listdir(base_path)
    num_files_indexed = len(allfiles)
    docnumber = 0
    self._inverted_index = collections.defaultdict(list)

    docnumlist = []
    for file in allfiles: 
            self.documents = [base_path+file] #list of all text files
            f = open(base_path+file, 'r')
            lines = f.read()

            tokens = self.tokenize(lines)
            docnumber = docnumber + 1
            for term in tokens:  
                if term not in sorted(self._inverted_index.keys()):
                    self._inverted_index[term] = [docnumber]
                    self._inverted_index[term][docnumber] +=1                                           
                else:
                    if docnumber not in self._inverted_index.get(term):
                        docnumlist = self._inverted_index.get(term)
                        docnumlist = docnumlist.append(docnumber)
            f.close()
    print '\n \n'
    print 'Dictionary contents: \n'
    for term in sorted(self._inverted_index):
        print term, '->', self._inverted_index.get(term)
    return num_files_indexed
    return 0

I get index error on executing this code: list index out of range.

The above code generates a dictionary index that stores the 'term' as a key and the document numbers in which the term occurs as a list.
For ex: if the term 'cat' occurs in documents 1.txt, 5.txt and 7.txt the dictionary will have:
cat <- [1,5,7]

Now, I have to modify it to add term frequency, so if the word cat occurs twice in document 1, thrice in document 5 and once in document 7:
expected result:
term <-[[docnumber, term freq], [docnumber,term freq]] <--list of lists in a dict!!!
cat <- [[1,2],[5,3],[7,1]]

I played around with the code, but nothing works. I have no clue to modify this datastructure to achieve the above.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

眼眸里的快感 2024-10-03 02:21:14

首先,使用工厂。从: 开始

def makecounter():
    return collections.defaultdict(int)

,然后使用

self._inverted_index = collections.defaultdict(makecounter)

and 作为 for term in tokens: 循环,

        for term in tokens:  
                self._inverted_index[term][docnumber] +=1

这会在每个 self._inverted_index[term] 中留下一个字典,如

{1:2,5:3,7:1}

您的示例中所示。由于您希望在每个 self._inverted_index[term] 中添加一个列表列表,因此在循环结束后添加:

self._inverted_index = dict((t,[d,v[d] for d in sorted(v)])
                            for t in self._inverted_index)

一旦制作(以这种方式或任何其他方式 - 我只是显示一个构造它的简单方法!),当然,这个数据结构实际上会变得难以使用,因为您不必要地使其难以构造(dict of dict更有用且易于使用和构造),但是,嘿,一个人的肉&c;-)。

First, use a factory. Start with:

def makecounter():
    return collections.defaultdict(int)

and later use

self._inverted_index = collections.defaultdict(makecounter)

and as the for term in tokens: loop,

        for term in tokens:  
                self._inverted_index[term][docnumber] +=1

This leaves in each self._inverted_index[term] a dict such as

{1:2,5:3,7:1}

in your example case. Since you want instead in each self._inverted_index[term] a list of lists, then just after the end of the looping add:

self._inverted_index = dict((t,[d,v[d] for d in sorted(v)])
                            for t in self._inverted_index)

Once made (this way or any other -- I'm just showing a simple way to construct it!), this data structure will then actually be as awkward to use as you needlessly made it difficult to construct, of course (the dict of dict is much more useful and easy to use as well as to construct), but, hey, one's man meat &c;-).

梦巷 2024-10-03 02:21:14

这是您可以使用的通用算法,但您需要调整一些代码以适应它。
它生成一个字典,其中包含每个文件的字数统计字典。

filedicts = {}
for file in allfiles:
  filedicts[file] = {}

  for term in terms:
    filedict.setdefault(term, 0)
    filedict[term] += 1

Here is a general algorithm you could use, but you will have adapt some of your code to it.
It produce a dict containing a dictionary of word counts for each file.

filedicts = {}
for file in allfiles:
  filedicts[file] = {}

  for term in terms:
    filedict.setdefault(term, 0)
    filedict[term] += 1
最好是你 2024-10-03 02:21:14

也许您可以为(文档名,频率)创建一个简单的类。

然后你的字典可能有这种新数据类型的列表。您也可以创建列表的列表,但单独的数据类型会更干净。

Perhaps you could just create a simple class for (docname, frequency).

Then your dict could have lists of this new data type. You can do a list of lists, too, but a separate data type would be cleaner.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文