使用 levenshtein 匹配目标字符串 +额外的文字
我正在开发一个网站转换项目,我需要匹配不精确的字符串。我正在考虑使用 leveshtein,但我不知道应该为我的任务设置哪些参数。
假设我有一个目标字符串elephant
。例如,我想要拉的匹配是 elephant mouse
<?
$target = "elephant";
$data = array(
'elephant mouse',
'rhinoceros',
'alligator',
'hippopotamus',
'rat',
);
foreach ( $data as $datum ) {
echo "$target >> $datum == " . levenshtein($target, $datum) . "\n";
}
我得到了结果
elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7
因此,当 rhino
和 hippo
的值是 10 时,在我的在实际数据集上,我无法真正区分 elephant mouse
、rat
和 alligator
之间的区别,它们在 6 上并驾齐驱7. 这是假的数据,但在我的数据集中,长度更接近的单词只会比 target + extra
的单词获得更低的分数。
我应该如何配置levenshtein()
的选项?我可以为插入、替换和删除的成本设置新的整数值。什么权重可以给我我想要的东西?
(如果您能想到更好的标题,请编辑我的帖子)。
I'm working on a website conversion project, and I need to match inexact strings. I'm looking at using leveshtein, but I don't know what parameters I should set for my task.
Say I have a target string elephant
. The match I would want to pull is elephant mouse
, for example
<?
$target = "elephant";
$data = array(
'elephant mouse',
'rhinoceros',
'alligator',
'hippopotamus',
'rat',
);
foreach ( $data as $datum ) {
echo "$target >> $datum == " . levenshtein($target, $datum) . "\n";
}
And I get the result
elephant >> elephant mouse == 6
elephant >> rhinoceros == 10
elephant >> alligator == 7
elephant >> hippopotamus == 10
elephant >> rat == 7
So while rhino
and hippo
are at 10, in my actual data set, I couldn't really tell the difference between elephant mouse
, rat
and alligator
, which are neck-and-neck at 6 and 7. This is bogus data, but in my data set, words that are closer in length only get a much lower score than words that are target + extra
.
How should I configure the options of levenshtein()
? I can set new integer values for the cost of insertion, replacement, and deletion. What weighting will give me what I want?
(If you can think of a better title please edit my post).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能应该尝试使用
levenshtein()
匹配单个单词,而不是整个短语,因为如果某个短语包含与正在搜索的单词类似的内容,您显然希望将其视为良好的匹配。换句话说,将$datum
中的每个字符串拆分为单独的单词,对每个单词运行levenshtein($target, $word)
,然后选择最小的数字。 (如果$target
也可以由多个单词组成,那么您也需要拆分该单词。)我强烈怀疑您可以通过调整插入/删除/替换成本来达到预期的效果,因为 Levenshtein不考虑单个单词,只考虑整个字符串。您可以尝试使插入变得非常便宜,但这也会给例如“qwErtyLasdEdgfhdPasdxcHdfjAlkjNlkhTkjh”带来很好的分数,因为它包含所有正确的字母。
You should probably try to match individual words with
levenshtein()
rather than entire phrases, since you apparently want to consider a phrase a good match if it contains something that resembles the word being searched for. In other words, split each string in$datum
into individual words, runlevenshtein($target, $word)
for each word, and pick the lowest number. (If$target
also can consist of multiple words, you need to split that one too.)I strongly doubt that you can achieve the desired effect by tweaking the insertion/deletion/replacement costs, because the Levenshtein doesn't consider individual words, only the string as a whole. You could try to make insertion very cheap, but that would also give a good score to e.g. "qwErtyLasdEdgfhdPasdxcHdfjAlkjNlkhTkjh" since it contains all the right letters.
权重
levenshtein($target, $datum, 1, 10, 10)
给出了效果非常好的 :) 插入成本较低,而替换和删除成本较高。这意味着
target + extra
的得分较低,而长度相等或更短但字符不同的字符串则成本较高。The weighting
levenshtein($target, $datum, 1, 10, 10)
gives meWhich works very well :) Insertion is a low cost, while both replacement and deletion are high. This means that
target + extra
has a low score, where strings of equal or shorter length, but different characters, have a high cost.