当前位置：文江博客话题详情

如何检测两篇新闻文章是否具有相同的主题？（Python语义相似度）

发布于 2024-08-27 22:47:07 字数 177 浏览 7 评论 0原文

我正在尝试从一些特定网站上的文章中抓取标题和正文，类似于谷歌对谷歌新闻的做法。

问题是，在不同的网站上，他们可能有关于同一主题的文章，措辞略有不同。

谁能告诉我我需要知道什么才能编写一个比较算法来自动检测相似的文章？或者，是否有任何库可用于文本比较并返回某种类型的相似度评级？需要使用 Python 的解决方案。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

狂之美人 2024-09-03 22:47:07

我认为最简单的方法是使用 HuggingFace 库中的 SentenceSimilarity 模型，例如使用此模型

首先，您必须

pip install sentence_transformers

然后代码非常简单，正如您在提供的链接中看到的：

from sentence_transformers import SentenceTransformer
import numpy as np

sentences = ["Text number 1", "Text number 2"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
embeddings = model.encode(sentences)
np.dot(embeddings[0], embeddings[1], out=None)

点积的结果将是两个字符串之间的相似度得分。基本上， 1 表示它们相同，-1 表示它们相反（有关更多详细信息，请参阅此处）

I think that the most easy way to do that would be to use a SentenceSimilarity model from the HuggingFace library, for example by using this model

First you have to

pip install sentence_transformers

Then the code is pretty simple, as you can see in the provided link:

from sentence_transformers import SentenceTransformer
import numpy as np

sentences = ["Text number 1", "Text number 2"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
embeddings = model.encode(sentences)
np.dot(embeddings[0], embeddings[1], out=None)

The result of the dot product will the the similarity score between the two strings. Basically, 1 means they are the same, -1 means they are opposite (for more details look here)

回复收藏 0 原文

~没有更多了~