如何将非常大的字典写入JSON文件？

发布于 2025-02-12 18:49:28 字数 1194 浏览 0 评论 0原文

请告诉我我在做什么错。我想从大量数据中创建一个字典。我有一个包含125,000个唯一单词的200,000篇文章的文件，看起来像这样：

data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]], 
        [['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
        [['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]

每个列表都是一个单独的文章，由列表组成：[artict_number，word，word，word strize（tfidf）]。

我需要获取表格字典：

{'city': [('article_1', 0.43)],
 'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
 'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
 'sun': [('article_2', 0.65)],
 'shining': [('article_2', 0.21)],
 'butterfly': [('article_3',0.25)]}

我正在尝试通过立即将其写入JSON文件（在Google Colab中）来创建字典，但这使用了所有RAM，结果，Google COLAB环境被禁用了。我使用此代码：

with open('drive/MyDrive/dictionary.json', 'w') as f:
    d = defaultdict(list) 
    [d[i[1]].append((i[0],i[2])) for j in data for i in j]
    dct = dict(d.items())
    dct = dict(sorted(d.items()))
    f.write(json.dumps(dct) + '\n')

请告诉我，我在做什么错？还是最好以其他方式存储如此大量的数据？

原文

Please tell me what am I doing wrong. I want to create a dictionary from a lot of data. I have a file of 200,000 articles that contains 125,000 unique words, it looks like this:

data = [[['article_1', 'city', 0.43], ['article_1', 'big', 0.38], ['article_1', 'beautiful', 0.25]], 
        [['article_2', 'sun', 0.65], ['article_2', 'beautiful', 0.41], ['article_2', 'shining', 0.21]],
        [['article_3', 'big', 0.72], ['article_3', 'beautiful', 0.50], ['article_3', 'butterfly', 0.25]]]

That is, each list is a separate article, which consists of lists: [article_number, word, word weight (tfidf)].

I need to get a dictionary of the form:

{'city': [('article_1', 0.43)],
 'big': [('article_3', 'big', 0.72), ('article_1', 'big', 0.38)],
 'beautiful': [('article_3', 0.50), ('article_2', 0.41), ('article_1', 0.25)],
 'sun': [('article_2', 0.65)],
 'shining': [('article_2', 0.21)],
 'butterfly': [('article_3',0.25)]}

I'm trying to create a dictionary by immediately writing it to a json file (in google colab), but this uses all the RAM, and as a result, the google colab environment is disabled. I use this code:

with open('drive/MyDrive/dictionary.json', 'w') as f:
    d = defaultdict(list) 
    [d[i[1]].append((i[0],i[2])) for j in data for i in j]
    dct = dict(d.items())
    dct = dict(sorted(d.items()))
    f.write(json.dumps(dct) + '\n')

Tell me, please, what am I doing wrong? Or maybe it is better to store such a large amount of data in some other way?

分享到QQ

分享到微博