使用 levenshtein 匹配目标字符串 +额外的文字

发布于 2024-12-03 14:14:50 字数 981 浏览 6 评论 0原文

我正在开发一个网站转换项目，我需要匹配不精确的字符串。我正在考虑使用 leveshtein，但我不知道应该为我的任务设置哪些参数。

假设我有一个目标字符串elephant。例如，我想要拉的匹配是 elephant mouse

<?

$target = "elephant";

$data = array(
  'elephant mouse',
  'rhinoceros',
  'alligator',
  'hippopotamus',
  'rat',
);

foreach ( $data as $datum ) {
  echo "$target >> $datum == " .  levenshtein($target, $datum) . "\n";
}

我得到了结果

elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7

因此，当 rhino 和 hippo 的值是 10 时，在我的在实际数据集上，我无法真正区分 elephant mouse、rat 和 alligator 之间的区别，它们在 6 上并驾齐驱7. 这是假的数据，但在我的数据集中，长度更接近的单词只会比 target + extra 的单词获得更低的分数。

我应该如何配置levenshtein()的选项？我可以为插入、替换和删除的成本设置新的整数值。什么权重可以给我我想要的东西？

（如果您能想到更好的标题，请编辑我的帖子）。

原文

I'm working on a website conversion project, and I need to match inexact strings. I'm looking at using leveshtein, but I don't know what parameters I should set for my task.

Say I have a target string elephant. The match I would want to pull is elephant mouse, for example

<?

$target = "elephant";

$data = array(
  'elephant mouse',
  'rhinoceros',
  'alligator',
  'hippopotamus',
  'rat',
);

foreach ( $data as $datum ) {
  echo "$target >> $datum == " .  levenshtein($target, $datum) . "\n";
}

And I get the result

elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7

So while rhino and hippo are at 10, in my actual data set, I couldn't really tell the difference between elephant mouse, rat and alligator, which are neck-and-neck at 6 and 7. This is bogus data, but in my data set, words that are closer in length only get a much lower score than words that are target + extra.

How should I configure the options of levenshtein()? I can set new integer values for the cost of insertion, replacement, and deletion. What weighting will give me what I want?

(If you can think of a better title please edit my post).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寂寞清仓 2024-12-10 14:14:51

您可能应该尝试使用 levenshtein() 匹配单个单词，而不是整个短语，因为如果某个短语包含与正在搜索的单词类似的内容，您显然希望将其视为良好的匹配。换句话说，将 $datum 中的每个字符串拆分为单独的单词，对每个单词运行 levenshtein($target, $word)，然后选择最小的数字。（如果 $target 也可以由多个单词组成，那么您也需要拆分该单词。）

我强烈怀疑您可以通过调整插入/删除/替换成本来达到预期的效果，因为 Levenshtein不考虑单个单词，只考虑整个字符串。您可以尝试使插入变得非常便宜，但这也会给例如“qwErtyLasdEdgfhdPasdxcHdfjAlkjNlkhTkjh”带来很好的分数，因为它包含所有正确的字母。

回复收藏 0 原文

偷得浮生 2024-12-10 14:14:50

权重 levenshtein($target, $datum, 1, 10, 10) 给出了

elephant >> elephant mouse == 6
elephant >> rhinoceros == 65
elephant >> alligator == 52
elephant >> hippopotamus == 64
elephant >> rat == 60

效果非常好的 :) 插入成本较低，而替换和删除成本较高。这意味着 target + extra 的得分较低，而长度相等或更短但字符不同的字符串则成本较高。

The weighting levenshtein($target, $datum, 1, 10, 10) gives me

elephant >> elephant mouse == 6
elephant >> rhinoceros == 65
elephant >> alligator == 52
elephant >> hippopotamus == 64
elephant >> rat == 60

Which works very well :) Insertion is a low cost, while both replacement and deletion are high. This means that target + extra has a low score, where strings of equal or shorter length, but different characters, have a high cost.

回复收藏 0 原文

~没有更多了~