NLP-如何在问题列表中获取常见问题的列表
所有内容都在标题中,因此我基本上有几个问题的列表,即字符串,其想法是在第一个问题列表中获取另一个常见问题的列表。
我不知道这是否有意义,但我会尝试解释我尝试的方法。
该方法包括计算列表中每个元素的余弦相似性与其余元素不包括正在处理的元素以防止使用相同元素执行计算的元素。
也就是说,将创建一个字典,其中包含键作为正在处理的每个元素的索引,而值将是每个元素的索引列表,每个元素的索引列表,其余弦相似性与密钥索引的阈值高于阈值。
一旦创建了字典,其值列表的最高列表的键索引将被视为常见问题,之后您可以选择前10个或任何想要的数字。
首先,缺点是知道我 +60k问题(14天)需要大量时间执行。 其次,我不知道这是否是解决此问题的最佳方法,您怎么看? 最后,如果您有一个更清晰,更好的想法来解决问题,那么我全都可以帮助其他人有相同问题的人。
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
list_of_questions = ['How does umap know which high dimensional datapoint belongs to which cluster?',...]
score = dict()
threshold = 0.7
#tokenization
#sw contains the list of stopwords
sw = stopwords.words('english')
for index, main_question in enumerate(list_of_questions):
similarities = []
temp_list = list_of_questions.copy()
X_list = word_tokenize(main_question)
temp_list.pop(index)
for question_ in temp_list:
l1 =[];l2 =[]
Y_list = word_tokenize(question_)
if len(X_list) == 0 or len(Y_list) == 0:
continue
#remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
#form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
#cosine formula
try:
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
if cosine > threshold:
similarities.append(list_of_questions.index(question_))
print("Cosine similarity: ", cosine)
except:
continue
score[index] = similarities
Everything is in the title, so basically I have a list of several questions as strings and the idea is to get another list of frequently asked questions within that first list of questions.
I don't know if it'll make sense but I'll try to explain the approach I tried.
The approach consist of calculating the cosine similarity of each element of the list with the rest of the elements not including the element being processed to prevent performing calculations with the same element.
That said, a dictionary will be created containing the keys as the index of each element being processed, while the values will be a list of indexes of each element which has a cosine similarity above the threshold with the index of the key.
Once the dictionary has been created, the keys' indexes with the highest length of list on their values will be considered as being frequent questions, after that you can pick up the top 10 or any number you'd like.
Firstly, a downside is that it takes a lot of time to execute knowing that I've +60k questions (14 days).
Secondly, I don't know if it's the best way to solve this problem, what do you think?
Finally, if you have a more clearer and better idea to solve the problem, I'm all ears, it can also help other people with the same problem.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
list_of_questions = ['How does umap know which high dimensional datapoint belongs to which cluster?',...]
score = dict()
threshold = 0.7
#tokenization
#sw contains the list of stopwords
sw = stopwords.words('english')
for index, main_question in enumerate(list_of_questions):
similarities = []
temp_list = list_of_questions.copy()
X_list = word_tokenize(main_question)
temp_list.pop(index)
for question_ in temp_list:
l1 =[];l2 =[]
Y_list = word_tokenize(question_)
if len(X_list) == 0 or len(Y_list) == 0:
continue
#remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
#form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
#cosine formula
try:
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
if cosine > threshold:
similarities.append(list_of_questions.index(question_))
print("Cosine similarity: ", cosine)
except:
continue
score[index] = similarities
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议以下内容:
关于问题之间的相似性,有一些更有效的方法可以捕获相似性,请查看此库 https:// https:// https:// pypi.org/project/textdistance/ 。或者,您可以将Spacy用于此目的 https:// https://更好的程序
I would suggest the following:
Regarding the similarity between the questions, there are more efficient ways to capture the similarity, check out this library https://pypi.org/project/textdistance/ . or you can use SpaCy for this purpose https://betterprogramming.pub/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c