PHP/mysql数组查找算法

发布于 2024-07-11 21:03:47 字数 445 浏览 10 评论 0原文

我希望能够使用 php 搜索一个数组（或者更好的是，mysql 表的一列）来查找特定的字符串。但是，我的目标是让它返回找到的字符串和匹配字符的数量（按正确的顺序）或其他方式来查看搜索结果的合理性，这样我就可以利用该信息来决定是否我想默认显示排名靠前的结果或为用户提供排名前几个的选项。我知道我可以做类似的事情

$citysearch = mysql_query("  SELECT city FROM $table WHERE city LIKE '$city' ");

，但我无法找到一种方法来确定它的准确性。

目标是：
a) 如果搜索词是“milwakee”或类似的内容，则找到“Milwaukee”。
b) 如果搜索词是“west”，则返回“West Bend”和“Westmont”等内容。

有人知道这样做的好方法吗？

原文

I'd like to be able to use php search an array (or better yet, a column of a mysql table) for a particular string. However, my goal is for it to return the string it finds and the number of matching characters (in the right order) or some other way to see how reasonable the search results are, so then I can make use of that info to decide if I want to display the top result by default or give the user options of the top few.
I know I can do something like

$citysearch = mysql_query("  SELECT city FROM $table WHERE city LIKE '$city' ");

but I can't figure out a way to determine how accurate it is.

The goal would be:

a) find "Milwaukee" if the search term were "milwakee" or something similar.

b) if the search term were "west", return things like "West Bend" and "Westmont".

Anyone know a good way to do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谁对谁错谁最难过 2024-07-18 21:03:48

更多的搜索使我找到了 Levenshtein 距离，然后找到了imilar_text，这被证明是做到这一点的最佳方法。

similar_text("input string", "match against this", $pct_accuracy);

比较字符串，然后将准确性保存为变量。编辑距离决定了从一个字符串到另一个字符串需要对单个字符执行多少个删除、插入或替换函数，并允许对每个函数进行不同的加权（例如，您可以使替换的成本更高）一个字符而不是删除一个字符）。它显然比similar_text更快但不太准确。我在其他地方读过的其他帖子提到，对于少于 10000 个字符的字符串，速度上没有功能差异。

我最终使用了我发现的一些东西的修改版本来使其工作。这最终会保存前 3 个结果（精确匹配的情况除外）。

$input = $_POST["searchcity"];
$accuracy = 0;
$runner1acc = 0;
$runner2acc = 0;
while ($cityarr = mysql_fetch_row($allcities)) {
  $cityname = $cityarr[1];
  $cityid = $cityarr[0];
  $city = strtolower($cityname);
  $diff = similar_text($input, $city, $tempacc);

  // check for an exact match
  if ($tempacc == '100') {

    // closest word is this one (exact match)
    $closest = $cityname;
    $closestid = $cityid;
    $accuracy = 100;

    break;
  }

  if ($tempacc >= $accuracy) { // more accurate than current leader
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $closest;
    $runner1id = $closestid;
    $runner1acc = $accuracy;
    $closest  = $cityname;
    $closestid = $cityid;
    $accuracy = $tempacc;
  }
  if (($tempacc < $accuracy)&&($tempacc >= $runner1acc)) { // new 2nd place
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $cityname;
    $runner1id = $cityid;
    $runner1acc = $tempacc;
  }
  if (($tempacc < $runner1acc)&&($tempacc >= $runner2acc)) { // new 3rd place
    $runner2 = $cityname;
    $runner2id = $cityid;
    $runner2acc = $tempacc;
  }
}

echo "Input word: $input\n<BR>";
if ($accuracy == 100) {
  echo "Exact match found: $closestid $closest\n";
} elseif ($accuracy > 70) { // for high accuracies, assumes that it's correct
  echo "We think you meant $closestid $closest ($accuracy)\n";
} else {
  echo "Did you mean:<BR>";
  echo "$closestid $closest? ($accuracy)<BR>\n";
  echo "$runner1id $runner1 ($runner1acc)<BR>\n";
  echo "$runner2id $runner2 ($runner2acc)<BR>\n";
}

More searching led me to the Levenshtein distance and then to similar_text, which proved to be the best way to do this.

similar_text("input string", "match against this", $pct_accuracy);

compares the strings and then saves the accuracy as a variable. The Levenshtein distance determines how many delete, insert, or replace functions on a single character it would need to do to get from one string to the other, with an allowance for weighting each function differently (eg. you can make it cost more to replace a character than to delete a character). It's apparently faster but less accurate than similar_text. Other posts I've read elsewhere have mentioned that for strings of fewer than 10000 characters, there's no functional difference in speed.

I ended up using a modified version of something I found to make it work. This ends up saving the top 3 results (except in the case of an exact match).

$input = $_POST["searchcity"];
$accuracy = 0;
$runner1acc = 0;
$runner2acc = 0;
while ($cityarr = mysql_fetch_row($allcities)) {
  $cityname = $cityarr[1];
  $cityid = $cityarr[0];
  $city = strtolower($cityname);
  $diff = similar_text($input, $city, $tempacc);

  // check for an exact match
  if ($tempacc == '100') {

    // closest word is this one (exact match)
    $closest = $cityname;
    $closestid = $cityid;
    $accuracy = 100;

    break;
  }

  if ($tempacc >= $accuracy) { // more accurate than current leader
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $closest;
    $runner1id = $closestid;
    $runner1acc = $accuracy;
    $closest  = $cityname;
    $closestid = $cityid;
    $accuracy = $tempacc;
  }
  if (($tempacc < $accuracy)&&($tempacc >= $runner1acc)) { // new 2nd place
    $runner2 = $runner1;
    $runner2id = $runner1id;
    $runner2acc = $runner1acc;
    $runner1 = $cityname;
    $runner1id = $cityid;
    $runner1acc = $tempacc;
  }
  if (($tempacc < $runner1acc)&&($tempacc >= $runner2acc)) { // new 3rd place
    $runner2 = $cityname;
    $runner2id = $cityid;
    $runner2acc = $tempacc;
  }
}

echo "Input word: $input\n<BR>";
if ($accuracy == 100) {
  echo "Exact match found: $closestid $closest\n";
} elseif ($accuracy > 70) { // for high accuracies, assumes that it's correct
  echo "We think you meant $closestid $closest ($accuracy)\n";
} else {
  echo "Did you mean:<BR>";
  echo "$closestid $closest? ($accuracy)<BR>\n";
  echo "$runner1id $runner1 ($runner1acc)<BR>\n";
  echo "$runner2id $runner2 ($runner2acc)<BR>\n";
}

回复收藏 0 原文

小伙你站住 2024-07-18 21:03:48

这可能非常复杂，而且我个人并不知道有任何好的第三方库，尽管我确信它们存在。不过，其他人也许可以提出一些现成的解决方案。

我过去曾多次从头开始写过类似的东西。如果您沿着这条路线走下去，您可能不想在 PHP 中单独执行此操作，因为每个查询都将涉及获取所有记录并对它们执行计算。几乎肯定会涉及创建一组满足您的规范的索引表。

例如，您必须制定规则来说明如何将“Milwaukee”最终拼写为“milwakee”。我的解决方案是进行元音压缩和重复压缩（不确定这些是否实际上是搜索词）。因此，密尔沃基将被索引为：

milwaukee
m_lw__k__
m_lw_k_

当搜索查询进入“milwaukee”时，我将在文本输入上运行相同的过程，然后在索引表上运行搜索：

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw__k__', 'm_lw_k_')

当搜索查询进入时对于“milwakee”，我会在文本输入上运行相同的过程，然后在索引表上运行搜索：

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw_k__', 'm_lw_k_')

对于 Milwaukee（拼写正确），它将返回计数“3”。

对于 Milwakee（拼写错误），它将返回“2”作为计数（因为它与 m_lw__k__ 模式不匹配，因为它中间只有一个元音）。

如果您根据计数对结果进行排序，您最终会满足您的规则之一，即“Milwaukee”最终会比“Milwakee”被排序为更高的可能匹配项。

如果您想以通用方式构建此系统（如在查询中使用 $table 所暗示的那样），那么您可能需要在其中的某个位置另一个映射表来将您的术语映射到适当的桌子。

我并不是说这是最好的（甚至是好的）方法，只是我过去做过的一些事情可能对您有用，如果您打算尝试在没有第三方解决方案的情况下执行此操作。

This can be very complicated, and I am not personally aware of any good 3rd party libraries although I'm sure they exist. Others may be able to suggest some canned solutions, though.

I have written something similar from scratch a few times in the past. If you go down that route, it is probably not something you'd want to do in PHP by itself as every query would involve getting all of the records and performing your calculations on them. It will almost certainly involve creating a set of index tables that meet your specifications.

For instance, you would have to come up with rules for how you imagine that "Milwaukee" could end up spelled "milwakee." My solution to this was to do vowel compression and duplication compression (not sure if these are actually search terms). So, milwaukee would be indexed as:

milwaukee
m_lw__k__
m_lw_k_

When the search query came in for "milwaukee", I would run the same process on the text input, and then run a search on the index table for:

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw__k__', 'm_lw_k_')

When the search query came in for "milwakee", I would run the same process on the text input, and then run a search on the index table for:

SELECT cityId,
       COUNT(*)
  FROM myCityIndexTable
 WHERE term IN ('milwaukee', 'm_lw_k__', 'm_lw_k_')

In the case of Milwaukee (spelled correctly), it would return "3" for the count.

In the case of Milwakee (spelled incorrectly) ,it would return "2" for the count (since it would not match the m_lw__k__ pattern as it only had one vowel in the middle).

If you sort the results based on the count, you would end up meeting one of your rules, that "Milwaukee" would end up being sorted higher as a possible match than "Milwakee."

If you want to build this system in a generic way (as hinted by your use of $table in the query) then you'd probably need another mapping table somewhere in there to map your terms to the appropriate table.

I'm not suggesting this is the best (or even a good) way to go about this, just something I've done in the past that might prove useful to you if you plan to try and do this without a third party solution.

回复收藏 0 原文