有哪些好的方法可以找到“相关性”?两个文本体?

发布于 2024-08-03 15:54:36 字数 260 浏览 7 评论 0原文

问题是——我有几千个小文本片段,从几个单词到几个句子——最大的片段在磁盘上大约有 2k。我希望能够对每个进行比较,并计算相关性因子,以便我可以向用户显示相关信息。

有哪些好的方法可以做到这一点?是否有任何好的已知算法可以做到这一点,是否有任何 GPL 解决方案等?

我不需要它实时运行,因为我可以预先计算一切。我更关心获得好的结果而不是运行时间。

我只是想在编写自己的东西之前先询问 Stack Overflow 社区。之前肯定有人已经找到了很好的解决方案。

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.

What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?

I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.

I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

何以笙箫默 2024-08-10 15:54:36

这些关于语义相关性语义相似性可能会有所帮助。还有这个关于潜在语义分析的问题。

您还可以查看 Soundex 查找语音上“听起来相似”的单词。

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.

You could also look into Soundex for words that "sound alike" phonetically.

孤者何惧 2024-08-10 15:54:36

我从未使用过它,但您可能想了解一下Levenshtein distance

I've never used it, but you might want to look into Levenshtein distance

少女七分熟 2024-08-10 15:54:36

杰夫在播客中谈到了类似的事情,以找到右侧列出的相关问题。 (在播客 32 中)

一个重要的提示是删除所有 常用词,例如“the”“and”“this”等。这将为您留下更有意义的单词进行比较。

这是一个类似的问题 有吗一种告诉两个短语语义相似度的算法

Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)

One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.

And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases

鯉魚旗 2024-08-10 15:54:36

这对于合理的大文本来说是完全可行的,但对于较小的文本来说就更难了。

我这样做过一次,效果很好:

  • 过滤所有“一般”单词(如 a、an、the、in 等)(过滤大约 10-30% 的单词
  • )剩余的单词,存储最常见单词的前 x 个,这些是您的主题。
  • 作为额外的步骤,您可以创建 2/3/4 个后续单词的组,并将它们与其他文本中的组进行比较。我用它作为抄袭的衡量标准。

This is quite doable for reasonable large texts, however harder for smaller texts.

I did it once like this, and it worked pretty well:

  • Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
  • Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
  • As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
违心° 2024-08-10 15:54:36

请参阅 Manning 和 Raghavan 关于 MinHashing 的课程笔记并搜索类似项目,和 C#(?) 版本。我相信这些技术来自乌尔曼和莫特瓦尼的研究。

See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.

遗弃M 2024-08-10 15:54:36

这本可能相关。

编辑:这是一个相关的 SO 问题

This book may be relevant.

Edit: here is a related SO question

却一份温柔 2024-08-10 15:54:36

语音算法

文章,Beyond SoundEx - Fuzzy 函数在 MS SQL Server 中搜索,展示如何在 SQL 中安装和使用 SimMetrics 库服务器。该库可让您找到字符串之间的相对相似性,并包含多种算法。

我最终主要使用 Jaro Winkler 来匹配名称。以下是我询问有关在 SO 上匹配姓名的更多信息:根据人名匹配记录

一些基于 Levenshtein Distance 的算法也可以在SimMetric 库,可能对您的应用程序有用。

Phonetic algorithms

The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.

I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name

A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文