文本相似度算法
我有两个字幕文件。 我需要一个函数来判断它们是否代表相同的文本,或者相似的文本
有时,仅在一个文件中会有诸如“风在吹......音乐正在播放”之类的注释。 但80%的内容是一样的。该函数必须返回 TRUE(文件代表相同的文本)。 有时会出现拼写错误,例如 1 而不是 l (one - L ),如下所示: 她偷走了行李。 当然,这意味着函数必须返回 TRUE。
我的评论:
该函数应该返回文本相似度的百分比 - 同意
“所有人都很高兴”和“所有人都不高兴” - 这里会被视为拼写错误,因此会被视为相同的文本。确切地说,函数返回的百分比会较低,但足够高,可以说明这些短语是相似的。
请考虑是否要将 Levenshtein 应用于整个文件或只是搜索字符串 - 不确定 Levenshtein,但算法必须是应用于整个文件。不过,这将是一个很长的字符串。
I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.
My comments:
The function should return percentage of the similarity of texts - AGREE
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
Levenshtein 算法:http://en.wikipedia.org/wiki/Levenshtein_distance
结果以外的任何内容零意味着文本不“相同”。 “相似”是衡量它们有多远/近的指标。结果是一个整数。
Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.
对于您所描述的问题(即计算大字符串),您可以使用 余弦相似度,根据术语频率<,返回 0(完全不同)到 1(相同)之间的数字/a> 向量。
您可能想查看此处描述的几种实现:余弦相似度
For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.
You might want to look at several implementations that are described here: Cosine Similarity
您在这里期望太多,看起来您必须为您的特定需求编写一个函数。我建议从现有的文件比较应用程序开始(也许 diff 已经拥有您需要的一切)并且改进它以为您的输入提供良好的结果。
You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.
看看近似的 grep。它可能会给你一些指示,尽管它几乎肯定会像你所说的那样在大块文本上执行得很糟糕。
编辑:agrep 的原始版本不是开源的,因此您可能会从 http: //en.wikipedia.org/wiki/Agrep
Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep
编辑距离有很多替代方案。例如 Jaro-Winkler 距离。
这种算法的选择取决于语言、单词类型、人类输入的单词等等...
在这里你可以找到在一个库中实现多种算法的有用实现
There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
Here you find a helpful implementation of several algorithms within one library
如果您仍在寻找解决方案,请使用 S-Bert(Sentence Bert),这是一种轻量级算法,内部类似地使用余弦。
if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.