TF-IDF计算KeyError

发布于 2025-01-15 14:25:56 字数 1177 浏览 3 评论 0原文

我想计算文本文档的文档频率。首先,我创建了术语词典并计算了术语频率。我在这些步骤中没有任何问题,但是当我尝试使用下面的函数时,它会出现错误:

def computeDF(docList):
    df = {}
    df = dict.fromkeys(docList[0].keys(), 0)
    
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                df[word] += 1

    for word, val in df.items():
        df[word] = float(val)

    return df

像这样调用函数:

dictList = []
for i in range(N):
    # creating dictionary for all documents
    tokens = processed_text[i]
    dictionary = dict.fromkeys(tokens,0)

    # calculation of term frequencies for all documents
    for word in tokens:
        dictionary[word] += 1
        tf = termFreq(dictionary, tokens)
        dictList.append(dictionary)

    df = computeDF(dictList)

我使用 10 个字典的列表调用该函数,因为它与列表对象一起使用。

N = 10(文档数) dictList 像这样继续: dictList

错误:

line 155, in <module> df = computeDF(dictList)

line 134, in computeDF df[word] += 1
KeyError: 'flagstaff'

当我尝试时它有效具有相同对象类型的不同 python 文件中的函数。我不明白有什么问题。我该如何解决这个问题?

I want to calculate document frequencies of text documents. First I created the term dictionary and calculated the term frequencies. I have no problems in these steps, but when I try to use the function below it gives an error:

def computeDF(docList):
    df = {}
    df = dict.fromkeys(docList[0].keys(), 0)
    
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                df[word] += 1

    for word, val in df.items():
        df[word] = float(val)

    return df

Called the function like this:

dictList = []
for i in range(N):
    # creating dictionary for all documents
    tokens = processed_text[i]
    dictionary = dict.fromkeys(tokens,0)

    # calculation of term frequencies for all documents
    for word in tokens:
        dictionary[word] += 1
        tf = termFreq(dictionary, tokens)
        dictList.append(dictionary)

    df = computeDF(dictList)

I called the function with list of 10 dictionaries, because it works with list object.

N = 10 (num of documents)
dictList continues like this: dictList

Error:

line 155, in <module> df = computeDF(dictList)

line 134, in computeDF df[word] += 1
KeyError: 'flagstaff'

It works when I try the function in different python file with same object types. I don't understand what is the problem. How can I solve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

动听の歌 2025-01-22 14:25:57

如果您有 df = dict.fromkeys(docList[0].keys(), 0) ,您需要类似的东西,

keys = set()
for doc in docList:
    keys = keys.union(set(doc.keys()))
df = dict.fromkeys(docList[0].keys(), 0)

这样您就拥有所有文档的密钥,而不仅仅是第一个文档。如果你想在一行中完成它,你可以这样做:

keys = set().union(*[set(doc.keys()) for doc in docList])

Where you have df = dict.fromkeys(docList[0].keys(), 0) you need something like

keys = set()
for doc in docList:
    keys = keys.union(set(doc.keys()))
df = dict.fromkeys(docList[0].keys(), 0)

That way you have keys for all your docs not just the first one. If you want todo it in one line you can do it like this:

keys = set().union(*[set(doc.keys()) for doc in docList])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文