寻找更好的 javascript 文本匹配评分系统
我一直在很多项目中使用String Score。它非常适合对列表进行排序,例如名称、国家/地区等。
现在,我正在开发一个项目,我希望将一个术语与更大的文本集(而不仅仅是几个单词)进行匹配。比如,一个段落。
给定以下两个字符串:
string1 = "I want to eat.";
string2 = "I want to eat. Let's go eat. All this talk about eating is making me hungry. Ready to eat?";
我希望术语 eat
返回高于 string1
的 string2
。然而,string1
得分更高:
string1.score('eat');
> 0.5261904761904762
string2.score('eat');
> 0.4477777777777778
也许我认为 string2
应该得分更高是错误的,并且我很乐意听到支持该逻辑的论点(如果这是您的逻辑)。否则,关于更上下文的 JavaScript 匹配算法有什么想法吗?
I've been using String Score for a lot of projects. It's great for sorting lists, like names, countries, etc.
Right now, I'm working on a project where I want to match a term against a bigger set of text, not just a few words. Like, a paragraph.
Given the following two strings:
string1 = "I want to eat.";
string2 = "I want to eat. Let's go eat. All this talk about eating is making me hungry. Ready to eat?";
I'd like the term eat
to return string2
as higher than string1
. However, string1
scores higher:
string1.score('eat');
> 0.5261904761904762
string2.score('eat');
> 0.4477777777777778
Maybe I'm wrong in thinking string2
should score higher, and I'd love to hear arguments for that logic, if that is your logic. Otherwise, any ideas on a more contextual javascript matching algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果
score
未考虑重复次数,则string2
中仅出现一次"eat"
会添加到分数中,因此其他出现的 < code>“eat” 被视为不匹配的垃圾,计入总分。许多字符串相似度度量都是这样表现的,例如在编辑距离中,不匹配的字符越多,距离越低分数和重复被视为不匹配。
通过阅读 来源 它使用的是什么算法,我不清楚,但分数变量
似乎没有考虑多次重复。
如果您希望对多次出现进行计数,那么听起来您想要的不是字符串相似度算法,而是 模糊匹配算法,这样您就可以找到匹配的数量。
也许雪人女巫适合您。
If the
score
is not taking into account repetitions then only one occurrence of"eat"
instring2
adds to the score so the other occurrences of"eat"
are treated as unmatched garbage which counts against in the total score.Many string similarity metrics behave this way, e.g. in Edit distance the more non-matching characters the lower the score and repetitions are treated as non-matching.
It's not clear to me from reading the source what algo it is using, but the score variables
don't seem to take into account multiple repetitions.
If you want multiple occurrences to count, then it sounds like what you want is not a string-similarity algo, but a fuzzy match algo so you can find the number of matches.
Maybe yeti witch will work for you.