Python:列表列表的字典
def makecounter():
return collections.defaultdict(int)
class RankedIndex(object):
def __init__(self):
self._inverted_index = collections.defaultdict(list)
self._documents = []
self._inverted_index = collections.defaultdict(makecounter)
def index_dir(self, base_path):
num_files_indexed = 0
allfiles = os.listdir(base_path)
self._documents = os.listdir(base_path)
num_files_indexed = len(allfiles)
docnumber = 0
self._inverted_index = collections.defaultdict(list)
docnumlist = []
for file in allfiles:
self.documents = [base_path+file] #list of all text files
f = open(base_path+file, 'r')
lines = f.read()
tokens = self.tokenize(lines)
docnumber = docnumber + 1
for term in tokens:
if term not in sorted(self._inverted_index.keys()):
self._inverted_index[term] = [docnumber]
self._inverted_index[term][docnumber] +=1
else:
if docnumber not in self._inverted_index.get(term):
docnumlist = self._inverted_index.get(term)
docnumlist = docnumlist.append(docnumber)
f.close()
print '\n \n'
print 'Dictionary contents: \n'
for term in sorted(self._inverted_index):
print term, '->', self._inverted_index.get(term)
return num_files_indexed
return 0
执行此代码时出现索引错误:列表索引超出范围。
上面的代码生成一个字典索引,将“术语”存储为键,并将该术语出现的文档编号存储为列表。 例如:如果术语“cat”出现在文档 1.txt、5.txt 和 7.txt 中,则字典将具有: cat <- [1,5,7]
现在,我必须修改它以添加术语频率,因此如果单词 cat 在文档 1 中出现两次,在文档 5 中出现三次,在文档 7 中出现一次: 预期结果: term <-[[docnumber, term freq], [docnumber,term freq]] <--字典中的列表!!! cat <- [[1,2],[5,3],[7,1]]
我尝试了代码,但没有任何效果。我不知道如何修改这个数据结构来实现上述目标。
提前致谢。
def makecounter():
return collections.defaultdict(int)
class RankedIndex(object):
def __init__(self):
self._inverted_index = collections.defaultdict(list)
self._documents = []
self._inverted_index = collections.defaultdict(makecounter)
def index_dir(self, base_path):
num_files_indexed = 0
allfiles = os.listdir(base_path)
self._documents = os.listdir(base_path)
num_files_indexed = len(allfiles)
docnumber = 0
self._inverted_index = collections.defaultdict(list)
docnumlist = []
for file in allfiles:
self.documents = [base_path+file] #list of all text files
f = open(base_path+file, 'r')
lines = f.read()
tokens = self.tokenize(lines)
docnumber = docnumber + 1
for term in tokens:
if term not in sorted(self._inverted_index.keys()):
self._inverted_index[term] = [docnumber]
self._inverted_index[term][docnumber] +=1
else:
if docnumber not in self._inverted_index.get(term):
docnumlist = self._inverted_index.get(term)
docnumlist = docnumlist.append(docnumber)
f.close()
print '\n \n'
print 'Dictionary contents: \n'
for term in sorted(self._inverted_index):
print term, '->', self._inverted_index.get(term)
return num_files_indexed
return 0
I get index error on executing this code: list index out of range.
The above code generates a dictionary index that stores the 'term' as a key and the document numbers in which the term occurs as a list.
For ex: if the term 'cat' occurs in documents 1.txt, 5.txt and 7.txt the dictionary will have:
cat <- [1,5,7]
Now, I have to modify it to add term frequency, so if the word cat occurs twice in document 1, thrice in document 5 and once in document 7:
expected result:
term <-[[docnumber, term freq], [docnumber,term freq]] <--list of lists in a dict!!!
cat <- [[1,2],[5,3],[7,1]]
I played around with the code, but nothing works. I have no clue to modify this datastructure to achieve the above.
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,使用工厂。从: 开始
,然后使用
and 作为
for term in tokens:
循环,这会在每个
self._inverted_index[term]
中留下一个字典,如您的示例中所示。由于您希望在每个
self._inverted_index[term]
中添加一个列表列表,因此在循环结束后添加:一旦制作(以这种方式或任何其他方式 - 我只是显示一个构造它的简单方法!),当然,这个数据结构实际上会变得难以使用,因为您不必要地使其难以构造(dict of dict更有用且易于使用和构造),但是,嘿,一个人的肉&c;-)。
First, use a factory. Start with:
and later use
and as the
for term in tokens:
loop,This leaves in each
self._inverted_index[term]
a dict such asin your example case. Since you want instead in each
self._inverted_index[term]
a list of lists, then just after the end of the looping add:Once made (this way or any other -- I'm just showing a simple way to construct it!), this data structure will then actually be as awkward to use as you needlessly made it difficult to construct, of course (the dict of dict is much more useful and easy to use as well as to construct), but, hey, one's man meat &c;-).
这是您可以使用的通用算法,但您需要调整一些代码以适应它。
它生成一个字典,其中包含每个文件的字数统计字典。
Here is a general algorithm you could use, but you will have adapt some of your code to it.
It produce a dict containing a dictionary of word counts for each file.
也许您可以为(文档名,频率)创建一个简单的类。
然后你的字典可能有这种新数据类型的列表。您也可以创建列表的列表,但单独的数据类型会更干净。
Perhaps you could just create a simple class for (docname, frequency).
Then your dict could have lists of this new data type. You can do a list of lists, too, but a separate data type would be cleaner.