PHP阵列对相关性进行分类

发布于 2025-01-29 17:01:15 字数 713 浏览 0 评论 0 原文

我有一个php文本阵列,该数组的价值诸如“ 蓝色铅笔,蓝色笔,蓝色,红色铅笔,红色墨水,红色笔,蓝色笔记本等”等。

我需要通过每个数组项目,并按照匹配的相关性显示匹配结果。 例如,如果用户搜索术语“ blue ”,则第三个项目“蓝色”,这是一个完美的匹配,应在顶部列出,然后是第二项“蓝色笔”,然后由第1个“蓝铅笔”,最后是“蓝色笔记本”。休息所有非蓝色物品将被丢弃。

我尝试使用 sort rsort 在PHP数组上的功能(无论是在拉匹配蓝色项目之前和之后),但它们只是基于字母顺序和反向α列表进行排序。那里没有相关性匹配。 就像使用stort($ array)返回以下内容一样,

Blue
Blue Notebook
Blue Pen
Blue Pencil

它实际上并不符合预期的“相关”结果。

另外, Levenshtein 函数不合适,因为它具有最大长度255的字符串的限制。我的字符串可能更长。

为了绘制平行线,MySQL具有完成工作的匹配重点子句。

SELECT * , MATCH (col1, col2) AGAINST ('some words' IN NATURAL LANGUAGE MODE)

寻找PHP中类似的东西,如果有人可以提供任何指针或任何要编写的UDF。

I have a PHP text array, which holds values like "Blue Pencil, Blue Pen, Blue, Red Pencil, Red Ink, Red Pen, Blue Notebook, etc...."

I need to run through each array item, and show the matching results in order of matching RELEVANCE.
Like, if the user searches for the term "Blue", then the 3rd item "Blue" which is a perfect match should get listed at the top, followed by 2nd item "Blue Pen", then by 1st "Blue Pencil" and finally by "Blue Notebook". Rest all non-Blue items will be discarded.

I tried using the sort and rsort functions on PHP arrays (both before and after pulling matching Blue items), but they simply sort based on alphabetical and reverse-alpha listing. There is no relevance match in there.
Like using sort($array) returns the following

Blue
Blue Notebook
Blue Pen
Blue Pencil

which is NOT really as per the expected "relevant" result.

Also the levenshtein function does NOT fit, as it has a restriction that it works on strings with maximum length 255. My strings can be longer.

To draw a parallel, MySQL has this match-against clause which does the work.

SELECT * , MATCH (col1, col2) AGAINST ('some words' IN NATURAL LANGUAGE MODE)

Looking for something similar in PHP, if anyone can provide any pointers or any UDF to be written.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤寂小茶 2025-02-05 17:01:15

FullText搜索是一个复杂的主题。

array_filter 的组合使用 levenshtein 将导致您想要此特定查询的答案,但是您会发现它很快就会崩溃对于其他查询:


$data = explode(', ', 'Blue Pencil, Blue Pen, Blue, Red Pencil, Red Ink, Red Pen, Blue Notebook');
$query = 'Blue';

// Do an exact match first
$data = array_filter($data, fn ($s) => str_contains($s, $query));

// Sort by the Levenshtein distance from the $query
usort($data, fn($a, $b) => levenshtein($query, $a) - levenshtein($query, $b));

var_dump($data);

// Will print: 
// array(4) {
//    [0]=>
//   string(4) "Blue"
//   [1]=>
//   string(8) "Blue Pen"
//   [2]=>
//   string(11) "Blue Pencil"
//   [3]=>
//   string(13) "Blue Notebook"
// }

考虑:

  • 如果用户使用不同的资本化(确切匹配不起作用)会发生什么情况
  • ,如果用户正在寻找“蓝色笔记本”(您需要某种 string tokenization
  • 您是否想删除/忽略某些单词? (例如““”,“”等),
  • 如果您有成千上万的单词可以查看会发生什么?该解决方案不是很有性能。

您最终可能会发现您最终会触及真正的搜索引擎,例如Apache Lucene或Elasticsearch。

Fulltext search is a complex subject.

A combination of array_filter and usort with levenshtein will result in the answer you want for this particular query, but you will find that it quickly falls apart for other queries:


$data = explode(', ', 'Blue Pencil, Blue Pen, Blue, Red Pencil, Red Ink, Red Pen, Blue Notebook');
$query = 'Blue';

// Do an exact match first
$data = array_filter($data, fn ($s) => str_contains($s, $query));

// Sort by the Levenshtein distance from the $query
usort($data, fn($a, $b) => levenshtein($query, $a) - levenshtein($query, $b));

var_dump($data);

// Will print: 
// array(4) {
//    [0]=>
//   string(4) "Blue"
//   [1]=>
//   string(8) "Blue Pen"
//   [2]=>
//   string(11) "Blue Pencil"
//   [3]=>
//   string(13) "Blue Notebook"
// }

Think about:

  • What happens if a user uses different capitalization (exact match won't work)
  • What if a user is looking for "a blue notebook" (you'd need some kind of string tokenization)
  • Do you want to remove/ignore certain words? (such as "the", "a", etc.)
  • What happens if you have thousands of words to look through? This solution won't be very performant.

You may eventually find that you end up reaching for a true search engine, such as Apache Lucene or Elasticsearch.

书间行客 2025-02-05 17:01:15

$输入= [
“蓝色铅笔”,“蓝色笔”,“蓝色”,“红色铅笔”,“红色墨水”,“红色笔”,“蓝色笔记本”
];

$ result = preg_grep(“/^blue/i”,$ input);
print_r($ result);

$input = [
"Blue Pencil", "Blue Pen", "Blue", "Red Pencil", "Red Ink", "Red Pen", "Blue Notebook"
];

$result = preg_grep("/^blue/i", $input);
print_r($result);

朕就是辣么酷 2025-02-05 17:01:15

您需要ASORT功能

$data = array("Blue Pencil", "Blue Pen", "Blue", "Red Pencil", "Red Ink", "Red Pen", "Blue Notebook");
asort($data);
print_r($data);

输出

Array ( [2] => Blue [6] => Blue Notebook [1] => Blue Pen [0] => Blue Pencil [4] => Red Ink [5] => Red Pen [3] => Red Pencil )

You need asort function

$data = array("Blue Pencil", "Blue Pen", "Blue", "Red Pencil", "Red Ink", "Red Pen", "Blue Notebook");
asort($data);
print_r($data);

Output

Array ( [2] => Blue [6] => Blue Notebook [1] => Blue Pen [0] => Blue Pencil [4] => Red Ink [5] => Red Pen [3] => Red Pencil )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文