文本相似度算法

发布于 2024-08-22 23:10:04 字数 435 浏览 14 评论 0原文

我有两个字幕文件。 我需要一个函数来判断它们是否代表相同的文本,或者相似的文本

有时,仅在一个文件中会有诸如“风在吹......音乐正在播放”之类的注释。 但80%的内容是一样的。该函数必须返回 TRUE(文件代表相同的文本)。 有时会出现拼写错误,例如 1 而不是 l (one - L ),如下所示: 她偷走了行李。 当然,这意味着函数必须返回 TRUE。

我的评论:
该函数应该返回文本相似度的百分比 - 同意

“所有人都很高兴”和“所有人都不高兴” - 这里会被视为拼写错误,因此会被视为相同的文本。确切地说,函数返回的百分比会较低,但足够高,可以说明这些短语是相似的。

请考虑是否要将 Levenshtein 应用于整个文件或只是搜索字符串 - 不确定 Levenshtein,但算法必须是应用于整个文件。不过,这将是一个很长的字符串。

I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text

Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.

My comments:
The function should return percentage of the similarity of texts - AGREE

"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar

Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

枕头说它不想醒 2024-08-29 23:10:05

Levenshtein 算法:http://en.wikipedia.org/wiki/Levenshtein_distance

结果以外的任何内容零意味着文本不“相同”。 “相似”是衡量它们有多远/近的指标。结果是一个整数。

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance

Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

关于从前 2024-08-29 23:10:05

对于您所描述的问题(即计算大字符串),您可以使用 余弦相似度,根据术语频率<,返回 0(完全不同)到 1(相同)之间的数字/a> 向量。

您可能想查看此处描述的几种实现:余弦相似度

For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.

You might want to look at several implementations that are described here: Cosine Similarity

不奢求什么 2024-08-29 23:10:05

您在这里期望太多,看起来您必须为您的特定需求编写一个函数。我建议从现有的文件比较应用程序开始(也许 diff 已经拥有您需要的一切)并且改进它以为您的输入提供良好的结果。

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

暗恋未遂 2024-08-29 23:10:05

看看近似的 grep。它可能会给你一些指示,尽管它几乎肯定会像你所说的那样在大块文本上执行得很糟糕。

编辑:agrep 的原始版本不是开源的,因此您可能会从 http: //en.wikipedia.org/wiki/Agrep

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.

EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

羅雙樹 2024-08-29 23:10:05

编辑距离有很多替代方案。例如 Jaro-Winkler 距离

这种算法的选择取决于语言、单词类型、人类输入的单词等等...

在这里你可以找到在一个库中实现多种算法的有用实现

There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.

The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...

Here you find a helpful implementation of several algorithms within one library

玻璃人 2024-08-29 23:10:05

如果您仍在寻找解决方案,请使用 S-Bert(Sentence Bert),这是一种轻量级算法,内部类似地使用余弦。

if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文