如何将非常大的字典写入JSON文件?

发布于 2025-02-12 18:49:28 字数 1194 浏览 0 评论 0原文

请告诉我我在做什么错。我想从大量数据中创建一个字典。我有一个包含125,000个唯一单词的200,000篇文章的文件,看起来像这样:

data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]], 
        [['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
        [['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]

每个列表都是一个单独的文章,由列表组成:[artict_number,word,word,word strize(tfidf)]。

我需要获取表格字典:

{'city': [('article_1', 0.43)],
 'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
 'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
 'sun': [('article_2', 0.65)],
 'shining': [('article_2', 0.21)],
 'butterfly': [('article_3',0.25)]}

我正在尝试通过立即将其写入JSON文件(在Google Colab中)来创建字典,但这使用了所有RAM,结果,Google COLAB环境被禁用了。我使用此代码:

with open('drive/MyDrive/dictionary.json', 'w') as f:
    d = defaultdict(list) 
    [d[i[1]].append((i[0],i[2])) for j in data for i in j]
    dct = dict(d.items())
    dct = dict(sorted(d.items()))
    f.write(json.dumps(dct) + '\n')

请告诉我,我在做什么错?还是最好以其他方式存储如此大量的数据?

Please tell me what am I doing wrong. I want to create a dictionary from a lot of data. I have a file of 200,000 articles that contains 125,000 unique words, it looks like this:

data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]], 
        [['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
        [['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]

That is, each list is a separate article, which consists of lists: [article_number, word, word weight (tfidf)].

I need to get a dictionary of the form:

{'city': [('article_1', 0.43)],
 'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
 'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
 'sun': [('article_2', 0.65)],
 'shining': [('article_2', 0.21)],
 'butterfly': [('article_3',0.25)]}

I'm trying to create a dictionary by immediately writing it to a json file (in google colab), but this uses all the RAM, and as a result, the google colab environment is disabled. I use this code:

with open('drive/MyDrive/dictionary.json', 'w') as f:
    d = defaultdict(list) 
    [d[i[1]].append((i[0],i[2])) for j in data for i in j]
    dct = dict(d.items())
    dct = dict(sorted(d.items()))
    f.write(json.dumps(dct) + '\n')

Tell me, please, what am I doing wrong? Or maybe it is better to store such a large amount of data in some other way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文