查找相关文本(两个文本之间的相关性)
我试图通过相关性在数据库中找到类似的文章。
所以我将文本拆分为单词数组,然后删除常用单词(冠词、代词等),然后用皮尔逊系数函数比较两个文本。对于某些文本,它是有效的,但对于其他文本,它不太好(文本较大的文本具有较高的系数)。
有人可以建议一种查找相关文本的好方法吗?
I'm trying to find similar articles in database via correlation.
So i split text in array of words, then delete frequently used words (articles,pronouns and so on), then compare two text with pearson coefficient function. For some text it's works but for other it's not so good(texts with large text have higher coefficient).
Can somebody advice a good method to find related texts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您提到的一些问题归结为文档长度和整体词频的标准化。尝试tf-idf。
Some of the problems you mention boild down to normalizing over document length and overall word frequency. Try tf-idf.
首先也是最重要的,您需要明确相似性的确切含义以及两个文档何时(或多或少)相似。
如果您正在寻找的相似性是字面的,那么我将使用术语频率对文档进行矢量化,并使用余弦相似性将它们相互比较,因为文本本质上是定向数据。可以根据您的用例测试 tf-idf 和 log-entropy 加权方案。对于长文本,编辑距离效率低下。
如果您更关心语义,那么词嵌入是您的盟友。
First and foremost, you need to specify what you precisely mean by similarity and when two documents are (more/less) similar.
If the similarity you are looking for is literal, then I would vectorise the documents using term frequencies, and use the cosine similarity to liken them to each other given that texts are inherently directional data. tf-idf and log-entropy weighting schemes may be tested depending on your use-case. The edit distance is inefficient with long texts.
If you care more about the semantics, word embeddings are your ally.