寻找更好的 javascript 文本匹配评分系统

发布于 2024-11-29 15:56:01 字数 700 浏览 5 评论 0原文

我一直在很多项目中使用String Score。它非常适合对列表进行排序，例如名称、国家/地区等。

现在，我正在开发一个项目，我希望将一个术语与更大的文本集（而不仅仅是几个单词）进行匹配。比如，一个段落。

给定以下两个字符串：

string1 = "I want to eat.";
string2 = "I want to eat. Let's go eat. All this talk about eating is making me hungry. Ready to eat?";

我希望术语 eat 返回高于 string1 的 string2。然而，string1 得分更高：

string1.score('eat');
> 0.5261904761904762

string2.score('eat');
> 0.4477777777777778

也许我认为 string2 应该得分更高是错误的，并且我很乐意听到支持该逻辑的论点（如果这是您的逻辑）。否则，关于更上下文的 JavaScript 匹配算法有什么想法吗？

原文

I've been using String Score for a lot of projects. It's great for sorting lists, like names, countries, etc.

Right now, I'm working on a project where I want to match a term against a bigger set of text, not just a few words. Like, a paragraph.

Given the following two strings:

string1 = "I want to eat.";
string2 = "I want to eat. Let's go eat. All this talk about eating is making me hungry. Ready to eat?";

I'd like the term eat to return string2 as higher than string1. However, string1 scores higher:

string1.score('eat');
> 0.5261904761904762

string2.score('eat');
> 0.4477777777777778

Maybe I'm wrong in thinking string2 should score higher, and I'd love to hear arguments for that logic, if that is your logic. Otherwise, any ideas on a more contextual javascript matching algorithm?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

幻想少年梦 2024-12-06 15:56:01

如果 score 未考虑重复次数，则 string2 中仅出现一次 "eat" 会添加到分数中，因此其他出现的 < code>“eat” 被视为不匹配的垃圾，计入总分。

许多字符串相似度度量都是这样表现的，例如在编辑距离中，不匹配的字符越多，距离越低分数和重复被视为不匹配。

通过阅读来源它使用的是什么算法，我不清楚，但分数变量

var total_character_score = 0,
  start_of_string_bonus,
  abbreviation_score,
  fuzzies=1,
  final_score;

似乎没有考虑多次重复。

如果您希望对多次出现进行计数，那么听起来您想要的不是字符串相似度算法，而是模糊匹配算法，这样您就可以找到匹配的数量。

也许雪人女巫适合您。

If the score is not taking into account repetitions then only one occurrence of "eat" in string2 adds to the score so the other occurrences of "eat" are treated as unmatched garbage which counts against in the total score.

Many string similarity metrics behave this way, e.g. in Edit distance the more non-matching characters the lower the score and repetitions are treated as non-matching.

It's not clear to me from reading the source what algo it is using, but the score variables

var total_character_score = 0,
  start_of_string_bonus,
  abbreviation_score,
  fuzzies=1,
  final_score;

don't seem to take into account multiple repetitions.

If you want multiple occurrences to count, then it sounds like what you want is not a string-similarity algo, but a fuzzy match algo so you can find the number of matches.

Maybe yeti witch will work for you.

回复收藏 0 原文

~没有更多了~