如何将非常大的字典写入JSON文件?
请告诉我我在做什么错。我想从大量数据中创建一个字典。我有一个包含125,000个唯一单词的200,000篇文章的文件,看起来像这样:
data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]],
[['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
[['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]
每个列表都是一个单独的文章,由列表组成:[artict_number,word,word,word strize(tfidf)]。
我需要获取表格字典:
{'city': [('article_1', 0.43)],
'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
'sun': [('article_2', 0.65)],
'shining': [('article_2', 0.21)],
'butterfly': [('article_3',0.25)]}
我正在尝试通过立即将其写入JSON文件(在Google Colab中)来创建字典,但这使用了所有RAM,结果,Google COLAB环境被禁用了。我使用此代码:
with open('drive/MyDrive/dictionary.json', 'w') as f:
d = defaultdict(list)
[d[i[1]].append((i[0],i[2])) for j in data for i in j]
dct = dict(d.items())
dct = dict(sorted(d.items()))
f.write(json.dumps(dct) + '\n')
请告诉我,我在做什么错?还是最好以其他方式存储如此大量的数据?
Please tell me what am I doing wrong. I want to create a dictionary from a lot of data. I have a file of 200,000 articles that contains 125,000 unique words, it looks like this:
data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]],
[['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
[['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]
That is, each list is a separate article, which consists of lists: [article_number, word, word weight (tfidf)].
I need to get a dictionary of the form:
{'city': [('article_1', 0.43)],
'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
'sun': [('article_2', 0.65)],
'shining': [('article_2', 0.21)],
'butterfly': [('article_3',0.25)]}
I'm trying to create a dictionary by immediately writing it to a json file (in google colab), but this uses all the RAM, and as a result, the google colab environment is disabled. I use this code:
with open('drive/MyDrive/dictionary.json', 'w') as f:
d = defaultdict(list)
[d[i[1]].append((i[0],i[2])) for j in data for i in j]
dct = dict(d.items())
dct = dict(sorted(d.items()))
f.write(json.dumps(dct) + '\n')
Tell me, please, what am I doing wrong? Or maybe it is better to store such a large amount of data in some other way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论