NLP-如何在问题列表中获取常见问题的列表

发布于 2025-02-08 06:26:06 字数 2001 浏览 1 评论 0原文

所有内容都在标题中，因此我基本上有几个问题的列表，即字符串，其想法是在第一个问题列表中获取另一个常见问题的列表。

我不知道这是否有意义，但我会尝试解释我尝试的方法。

该方法包括计算列表中每个元素的余弦相似性与其余元素不包括正在处理的元素以防止使用相同元素执行计算的元素。

也就是说，将创建一个字典，其中包含键作为正在处理的每个元素的索引，而值将是每个元素的索引列表，每个元素的索引列表，其余弦相似性与密钥索引的阈值高于阈值。

一旦创建了字典，其值列表的最高列表的键索引将被视为常见问题，之后您可以选择前10个或任何想要的数字。

首先，缺点是知道我 +60k问题（14天）需要大量时间执行。其次，我不知道这是否是解决此问题的最佳方法，您怎么看？最后，如果您有一个更清晰，更好的想法来解决问题，那么我全都可以帮助其他人有相同问题的人。

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

list_of_questions = ['How does umap know which high dimensional datapoint belongs to which cluster?',...]

score = dict()
threshold = 0.7
#tokenization

#sw contains the list of stopwords
sw = stopwords.words('english')

for index, main_question in enumerate(list_of_questions):
    similarities = []
    temp_list = list_of_questions.copy()
    X_list = word_tokenize(main_question)
    temp_list.pop(index)
    for question_ in temp_list:
        l1 =[];l2 =[]
        Y_list = word_tokenize(question_)
        
        if len(X_list) == 0 or len(Y_list) == 0:
            continue
        #remove stop words from the string
        X_set = {w for w in X_list if not w in sw} 
        Y_set = {w for w in Y_list if not w in sw}

        #form a set containing keywords of both strings 
        rvector = X_set.union(Y_set) 
        for w in rvector:
            if w in X_set: l1.append(1) # create a vector
            else: l1.append(0)
            if w in Y_set: l2.append(1)
            else: l2.append(0)
        c = 0

        #cosine formula 
        try:
            for i in range(len(rvector)):
                    c+= l1[i]*l2[i]
            cosine = c / float((sum(l1)*sum(l2))**0.5)
            if cosine > threshold:
                similarities.append(list_of_questions.index(question_))
                print("Cosine similarity: ", cosine)
        except:
            continue
    score[index] = similarities

原文

Everything is in the title, so basically I have a list of several questions as strings and the idea is to get another list of frequently asked questions within that first list of questions.

I don't know if it'll make sense but I'll try to explain the approach I tried.

The approach consist of calculating the cosine similarity of each element of the list with the rest of the elements not including the element being processed to prevent performing calculations with the same element.

That said, a dictionary will be created containing the keys as the index of each element being processed, while the values will be a list of indexes of each element which has a cosine similarity above the threshold with the index of the key.

Once the dictionary has been created, the keys' indexes with the highest length of list on their values will be considered as being frequent questions, after that you can pick up the top 10 or any number you'd like.

Firstly, a downside is that it takes a lot of time to execute knowing that I've +60k questions (14 days).
Secondly, I don't know if it's the best way to solve this problem, what do you think?
Finally, if you have a more clearer and better idea to solve the problem, I'm all ears, it can also help other people with the same problem.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')

list_of_questions = ['How does umap know which high dimensional datapoint belongs to which cluster?',...]

score = dict()
threshold = 0.7
#tokenization

#sw contains the list of stopwords
sw = stopwords.words('english')

for index, main_question in enumerate(list_of_questions):
    similarities = []
    temp_list = list_of_questions.copy()
    X_list = word_tokenize(main_question)
    temp_list.pop(index)
    for question_ in temp_list:
        l1 =[];l2 =[]
        Y_list = word_tokenize(question_)
        
        if len(X_list) == 0 or len(Y_list) == 0:
            continue
        #remove stop words from the string
        X_set = {w for w in X_list if not w in sw} 
        Y_set = {w for w in Y_list if not w in sw}

        #form a set containing keywords of both strings 
        rvector = X_set.union(Y_set) 
        for w in rvector:
            if w in X_set: l1.append(1) # create a vector
            else: l1.append(0)
            if w in Y_set: l2.append(1)
            else: l2.append(0)
        c = 0

        #cosine formula 
        try:
            for i in range(len(rvector)):
                    c+= l1[i]*l2[i]
            cosine = c / float((sum(l1)*sum(l2))**0.5)
            if cosine > threshold:
                similarities.append(list_of_questions.index(question_))
                print("Cosine similarity: ", cosine)
        except:
            continue
    score[index] = similarities

分享到QQ

分享到微博