如何在Python中使用gensim进行字符串语义匹配？

发布于 2025-01-20 12:13:32 字数 407 浏览 5 评论 0原文

在Python中，我们如何确定一个字符串与我们的短语是否有语义关系？

示例：

我们的短语是：

'Fruit and Vegetables'

我们想要检查语义关系的字符串是：

'I have an apple in my basket', 'I have a car in my house'

结果：

正如我们所知的第一项I have an apple in my篮子与我们的短语有关。

原文

how can we determine whether a string has a semantical relation with our phrase or not in python?

Example:

our phrase is:

'Fruit and Vegetables'

and the strings we want to check semantical relation in, are:

'I have an apple in my basket', 'I have a car in my house'

result:

as we know the first item I have an apple in my basket has a relation to our phrase.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深缘浅 2025-01-27 12:13:32

您可以使用 gensim 库来实现 MatchSemantic 并将这样的代码编写为函数（在此处查看完整代码）：

初始化

安装gensim和numpy：

pip install numpy
pip install gensim

代码

首先，我们必须实现要求，

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

使用这个函数来检查字符串和句子是否与你想要的短语匹配。

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

注意：
如果第一次运行代码，进程栏将从 0% 变为 100% 以下载 glove-wiki-gigaword-50 gensim ，之后一切都将被设置，您可以简单地运行代码。

例如，用法

中的任何句子或项目匹配，

，我们想要查看 Fruit and Vegetables 是否与 documents Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

因此我们知道第一个项目我的篮子里有一个苹果与水果和蔬菜有语义关系，因此它的分数将为0.189，而第二个项目则没有关系会被发现，所以它的分数将是0

输出：

0.189    # I have an apple in my basket
0.000    # I have a car in my house

You can use gensim library to implement MatchSemantic and write code like this as a function (see full code in here):

Initialization

install the gensim and numpy:

pip install numpy
pip install gensim

Code

first of all, we must implement the requirements

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

use this function to check if the strings and sentences match the phrase you want.

def MatchSemantic(query_string, documents):
    stopwords = ['the', 'and', 'are', 'a']

    if len(documents) == 1: documents.append('')

    def preprocess(doc):
        # Tokenize, clean up input document string
        doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
        doc = sub(r'<[^<>]+(>|$)', " ", doc)
        doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
        doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
        return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]

    # Preprocess the documents, including the query string
    corpus = [preprocess(document) for document in documents]
    query = preprocess(query_string)

    # Load the model: this is a big file, can take a while to download and open
    glove = api.load("glove-wiki-gigaword-50")
    similarity_index = WordEmbeddingSimilarityIndex(glove)

    # Build the term dictionary, TF-idf model
    dictionary = Dictionary(corpus + [query])
    tfidf = TfidfModel(dictionary=dictionary)

    # Create the term similarity matrix.
    similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)

    query_tf = tfidf[dictionary.doc2bow(query)]

    index = SoftCosineSimilarity(
        tfidf[[dictionary.doc2bow(document) for document in corpus]],
        similarity_matrix)

    return index[query_tf]

Attention:
if run the code for the first time a process bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code.

Usage

for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents

Test:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

so we know that the first item I have an apple on my basket has a semantical relation with Fruit and Vegetables so its score will be 0.189 and for the second item no relation will be found so its score will be 0

output:

0.189    # I have an apple in my basket
0.000    # I have a car in my house

回复收藏 0 原文

~没有更多了~

关于作者

好多鱼好多余

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何在Python中使用gensim进行字符串语义匹配？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

初始化

代码

例如，用法

Initialization

Code

Usage

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何在Python中使用gensim进行字符串语义匹配？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

初始化

代码

例如，用法

Initialization

Code

Usage

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。