如何在Python中使用gensim进行字符串语义匹配?

发布于 2025-01-20 12:13:32 字数 407 浏览 0 评论 0原文

在Python中,我们如何确定一个字符串与我们的短语是否有语义关系?

示例:

我们的短语是:

'Fruit and Vegetables'

我们想要检查语义关系的字符串是:

'I have an apple in my basket', 'I have a car in my house'

结果:

正如我们所知的第一项I have an apple in my篮子与我们的短语有关。

how can we determine whether a string has a semantical relation with our phrase or not in python?

Example:

our phrase is:

'Fruit and Vegetables'

and the strings we want to check semantical relation in, are:

'I have an apple in my basket', 'I have a car in my house'

result:

as we know the first item I have an apple in my basket has a relation to our phrase.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

情深缘浅 2025-01-27 12:13:32

您可以使用 gensim 库来实现 MatchSemantic 并将这样的代码编写为函数(在此处查看完整代码):

初始化


  1. 安装gensimnumpy
pip install numpy
pip install gensim

代码


  1. 首先,我们必须实现要求,
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
  1. 使用这个函数来检查字符串和句子是否与你想要的短语匹配。
def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

注意:
如果第一次运行代码,进程栏将从 0% 变为 100% 以下载 glove-wiki-gigaword-50 gensim ,之后一切都将被设置,您可以简单地运行代码。

例如,用法


中的任何句子或项目匹配,

,我们想要查看 Fruit and Vegetables 是否与 documents Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

因此我们知道第一个项目我的篮子里有一个苹果水果和蔬菜有语义关系,因此它的分数将为0.189,而第二个项目则没有关系会被发现,所以它的分数将是0

输出:

0.189    # I have an apple in my basket
0.000    # I have a car in my house

You can use gensim library to implement MatchSemantic and write code like this as a function (see full code in here):

Initialization


  1. install the gensim and numpy:
pip install numpy
pip install gensim

Code


  1. first of all, we must implement the requirements
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
  1. use this function to check if the strings and sentences match the phrase you want.
def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

Attention:
if run the code for the first time a process bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code.

Usage


for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents

Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

so we know that the first item I have an apple on my basket has a semantical relation with Fruit and Vegetables so its score will be 0.189 and for the second item no relation will be found so its score will be 0

output:

0.189    # I have an apple in my basket
0.000    # I have a car in my house
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文