关键字highlight是PHP中preg_replace()的高亮显示

发布于 2025-01-01 08:12:13 字数 2777 浏览 0 评论 0原文

我有一个小型搜索引擎在做它的事情,并且想突出显示结果。我以为我已经解决了这一切,直到我今天使用的一组关键字把它从水中搞出来了。

问题是 preg_replace() 正在循环替换,而后面的替换正在替换我插入到之前的文本中的文本。使困惑?这是我的伪函数:

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    foreach ($keywords as $kw) {
        $find[] = '/' . str_replace("/", "\/", $kw) . '/iu';
        $replace[] = $begin . "\$0" . $end;
    }
    return preg_replace($find, $replace, $data);
}

好的,所以它在搜索“fred”和“dagg”时有效,但遗憾的是,当搜索“class”、“lass”和“as”时,它在突出显示“Joseph's Class Group”时遇到了真正的问题

Joseph's <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group

我怎样才能让后者的替换仅适用于非 HTML 组件,但也允许标记整个匹配?例如,如果我正在搜索“cla”和“lass”,我希望“class”完全突出显示,因为两个搜索词都在其中,即使它们重叠,并且应用于第一个匹配的突出显示具有“ class”,但that不应该突出显示。

叹。

我宁愿使用 PHP 解决方案,也不愿使用 jQuery(或任何客户端)解决方案。

注意:我尝试按长度对关键字进行排序,首先进行长关键字,但这意味着交叉搜索不会突出显示,这意味着对于“cla”和“lass”,只有“class”一词的一部分会突出显示,它仍然谋杀了替换标签:(

编辑:我已经搞乱了,从铅笔和纸开始,以及疯狂的漫无目的,并想出一些非常无魅力的代码来解决这个问题。这不是很好,所以建议修剪/速度这个仍然会非常感激:)

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    $hits = array();
    foreach ($keywords as $kw) {
        $offset = 0;
        while (($pos = stripos($data, $kw, $offset)) !== false) {
            $hits[] = array($pos, $pos + strlen($kw));
            $offset = $pos + 1;
        }
    }
    if ($hits) {
        usort($hits, function($a, $b) {
            if ($a[0] == $b[0]) {
                return 0;
            }
            return ($a[0] < $b[0]) ? -1 : 1;
        });
        $thisthat = array(0 => $begin, 1 => $end);
        for ($i = 0; $i < count($hits); $i++) {
            foreach ($thisthat as $key => $val) {
                $pos = $hits[$i][$key];
                $data = substr($data, 0, $pos) . $val . substr($data, $pos);
                for ($j = 0; $j < count($hits); $j++) {
                    if ($hits[$j][0] >= $pos) {
                        $hits[$j][0] += strlen($val);
                    }
                    if ($hits[$j][1] >= $pos) {
                        $hits[$j][1] += strlen($val);
                    }
                }
            }
        }
    }
    return $data;
}

I have a small search engine doing its thing, and want to highlight the results. I thought I had it all worked out till a set of keywords I used today blew it out of the water.

The issue is that preg_replace() is looping through the replacements, and later replacements are replacing the text I inserted into previous ones. Confused? Here is my pseudo function:

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    foreach ($keywords as $kw) {
        $find[] = '/' . str_replace("/", "\/", $kw) . '/iu';
        $replace[] = $begin . "\$0" . $end;
    }
    return preg_replace($find, $replace, $data);
}

OK, so it works when searching for "fred" and "dagg" but sadly, when searching for "class" and "lass" and "as" it strikes a real issue when highlighting "Joseph's Class Group"

Joseph's <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group

How would I get the latter replacements to only work on the non-HTML components, but to also allow the tagging of the whole match? e.g. if I was searching for "cla" and "lass" I would want "class" to be highlighted in full as both the search terms are in it, even though they overlap, and the highlighting that was applied to the first match has "class" in it, but that shouldn't be highlighted.

Sigh.

I would rather use a PHP solution than a jQuery (or any client-side) one.

Note: I have tried to sort the keywords by length, doing the long ones first, but that means the cross-over searches do not highlight, meaning with "cla" and "lass" only part of the word "class" would highlight, and it still murdered the replacement tags :(

EDIT: I have messed about, starting with pencil & paper, and wild ramblings, and come up with some very unglamorous code to solve this issue. It's not great, so suggestions to trim/speed this up would still be greatly appreciated :)

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    $hits = array();
    foreach ($keywords as $kw) {
        $offset = 0;
        while (($pos = stripos($data, $kw, $offset)) !== false) {
            $hits[] = array($pos, $pos + strlen($kw));
            $offset = $pos + 1;
        }
    }
    if ($hits) {
        usort($hits, function($a, $b) {
            if ($a[0] == $b[0]) {
                return 0;
            }
            return ($a[0] < $b[0]) ? -1 : 1;
        });
        $thisthat = array(0 => $begin, 1 => $end);
        for ($i = 0; $i < count($hits); $i++) {
            foreach ($thisthat as $key => $val) {
                $pos = $hits[$i][$key];
                $data = substr($data, 0, $pos) . $val . substr($data, $pos);
                for ($j = 0; $j < count($hits); $j++) {
                    if ($hits[$j][0] >= $pos) {
                        $hits[$j][0] += strlen($val);
                    }
                    if ($hits[$j][1] >= $pos) {
                        $hits[$j][1] += strlen($val);
                    }
                }
            }
        }
    }
    return $data;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

吹泡泡o 2025-01-08 08:12:13

我使用以下方法来解决这个问题:

<?php

$protected_matches = array();
function protect(&$matches) {
    global $protected_matches;
    return "\0" . array_push($protected_matches, $matches[0]) . "\0";
}
function restore(&$matches) {
    global $protected_matches;
    return '<span class="keywordHighlight">' .
              $protected_matches[$matches[1] - 1] . '</span>';
}

preg_replace_callback('/\x0(\d+)\x0/', 'restore',
    preg_replace_callback($patterns, 'protect', $target_string));

第一个 preg_replace_callback 提取所有匹配项并用空字节包装的占位符替换它们;第二遍将它们替换为 span 标签。

编辑:忘记提及 $patterns 按字符串长度排序,从最长到最短。

编辑;另一种解决方案

<?php
        function highlightKeywords($data, $keywords = array(),
            $prefix = '<span class="hilite">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        $start = array();
        $end   = array();

        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $start[] = $pos;
                $end[]   = $offset = $pos + $length;
            }
        }

        if (!count($start)) return $data;

        sort($start);
        sort($end);

        // Merge and sort start/end using negative values to identify endpoints
        $zipper = array();
        $i = 0;
        $n = count($end);

        while ($i < $n)
            $zipper[] = count($start) && $start[0] <= $end[$i]
                ? array_shift($start)
                : -$end[$i++];

        // EXAMPLE:
        // [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ]
        // take 9, discard 10, take -14, take -14, create pair,
        // take 81, discard 82, discard 86, take -86, take -86, take -90, create pair
        // take 99, take -103, create pair
        // result: [9,14], [81,90], [99,103]

        // Generate non-overlapping start/end pairs
        $a = array_shift($zipper);
        $z = $x = null;
        while ($x = array_shift($zipper)) {
            if ($x < 0)
                $z = $x;
            else if ($z) {
                $spans[] = array($a, -$z);
                $a = $x;
                $z = null;
            }
        }
        $spans[] = array($a, -$z);

        // Insert the prefix/suffix in the start/end locations
        $n = count($spans);
        while ($n--)
            $data = substr($data, 0, $spans[$n][0])
            . $prefix
            . substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0])
            . $suffix
            . substr($data, $spans[$n][1]);

        return $data;
    }

I've used the following to address this problem:

<?php

$protected_matches = array();
function protect(&$matches) {
    global $protected_matches;
    return "\0" . array_push($protected_matches, $matches[0]) . "\0";
}
function restore(&$matches) {
    global $protected_matches;
    return '<span class="keywordHighlight">' .
              $protected_matches[$matches[1] - 1] . '</span>';
}

preg_replace_callback('/\x0(\d+)\x0/', 'restore',
    preg_replace_callback($patterns, 'protect', $target_string));

The first preg_replace_callback pulls out all matches and replaces them with nul-byte-wrapped placeholders; the second pass replaces them with the span tags.

Edit: Forgot to mention that $patterns was sorted by string length, longest to shortest.

Edit; another solution

<?php
        function highlightKeywords($data, $keywords = array(),
            $prefix = '<span class="hilite">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        $start = array();
        $end   = array();

        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $start[] = $pos;
                $end[]   = $offset = $pos + $length;
            }
        }

        if (!count($start)) return $data;

        sort($start);
        sort($end);

        // Merge and sort start/end using negative values to identify endpoints
        $zipper = array();
        $i = 0;
        $n = count($end);

        while ($i < $n)
            $zipper[] = count($start) && $start[0] <= $end[$i]
                ? array_shift($start)
                : -$end[$i++];

        // EXAMPLE:
        // [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ]
        // take 9, discard 10, take -14, take -14, create pair,
        // take 81, discard 82, discard 86, take -86, take -86, take -90, create pair
        // take 99, take -103, create pair
        // result: [9,14], [81,90], [99,103]

        // Generate non-overlapping start/end pairs
        $a = array_shift($zipper);
        $z = $x = null;
        while ($x = array_shift($zipper)) {
            if ($x < 0)
                $z = $x;
            else if ($z) {
                $spans[] = array($a, -$z);
                $a = $x;
                $z = null;
            }
        }
        $spans[] = array($a, -$z);

        // Insert the prefix/suffix in the start/end locations
        $n = count($spans);
        while ($n--)
            $data = substr($data, 0, $spans[$n][0])
            . $prefix
            . substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0])
            . $suffix
            . substr($data, $spans[$n][1]);

        return $data;
    }
挽梦忆笙歌 2025-01-08 08:12:13

今天我不得不自己重新审视这个主题并写了上述内容的更好版本。我将把它包括在这里。这是相同的想法,只是更容易阅读并且应该执行得更好,因为它使用数组而不是串联。

<?php

function highlight_range_sort($a, $b) {
    $A = abs($a);
    $B = abs($b);
    if ($A == $B)
        return $a < $b ? 1 : 0;
    else
        return $A < $B ? -1 : 1;
}

function highlightKeywords($data, $keywords = array(),
       $prefix = '<span class="highlight">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        // this will contain offset ranges to be highlighted
        // positive offset indicates start
        // negative offset indicates end
        $ranges = array();

        // find start/end offsets for each keyword
        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $ranges[] = $pos;
                $ranges[] = -($offset = $pos + $length);
            }
        }

        if (!count($ranges))
            return $data;

        // sort offsets by abs(), positive
        usort($ranges, 'highlight_range_sort');

        // combine overlapping ranges by keeping lesser
        // positive and negative numbers
        $i = 0;
        while ($i < count($ranges) - 1) {
            if ($ranges[$i] < 0) {
                if ($ranges[$i + 1] < 0)
                    array_splice($ranges, $i, 1);
                else
                    $i++;
            } else if ($ranges[$i + 1] < 0)
                $i++;
            else
                array_splice($ranges, $i + 1, 1);
        }

        // create substrings
        $ranges[] = strlen($data);
        $substrings = array(substr($data, 0, $ranges[0]));
        for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) {
            // prefix + highlighted_text + suffix + regular_text
            $substrings[] = $prefix;
            $substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]);
            $substrings[] = $suffix;
            $substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]);
        }

        // join and return substrings
        return implode('', $substrings);
}

// Example usage:
echo highlightKeywords("This is a test.\n", array("is"), '(', ')');
echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')');
// Output:
// Th(is) (is) a test.
// (Class)es are (as) hard (as) they say.

I had to revisit this subject myself today and wrote a better version of the above. I'll include it here. It's the same idea only easier to read and should perform better since it uses arrays instead of concatenation.

<?php

function highlight_range_sort($a, $b) {
    $A = abs($a);
    $B = abs($b);
    if ($A == $B)
        return $a < $b ? 1 : 0;
    else
        return $A < $B ? -1 : 1;
}

function highlightKeywords($data, $keywords = array(),
       $prefix = '<span class="highlight">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        // this will contain offset ranges to be highlighted
        // positive offset indicates start
        // negative offset indicates end
        $ranges = array();

        // find start/end offsets for each keyword
        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $ranges[] = $pos;
                $ranges[] = -($offset = $pos + $length);
            }
        }

        if (!count($ranges))
            return $data;

        // sort offsets by abs(), positive
        usort($ranges, 'highlight_range_sort');

        // combine overlapping ranges by keeping lesser
        // positive and negative numbers
        $i = 0;
        while ($i < count($ranges) - 1) {
            if ($ranges[$i] < 0) {
                if ($ranges[$i + 1] < 0)
                    array_splice($ranges, $i, 1);
                else
                    $i++;
            } else if ($ranges[$i + 1] < 0)
                $i++;
            else
                array_splice($ranges, $i + 1, 1);
        }

        // create substrings
        $ranges[] = strlen($data);
        $substrings = array(substr($data, 0, $ranges[0]));
        for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) {
            // prefix + highlighted_text + suffix + regular_text
            $substrings[] = $prefix;
            $substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]);
            $substrings[] = $suffix;
            $substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]);
        }

        // join and return substrings
        return implode('', $substrings);
}

// Example usage:
echo highlightKeywords("This is a test.\n", array("is"), '(', ')');
echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')');
// Output:
// Th(is) (is) a test.
// (Class)es are (as) hard (as) they say.
相思碎 2025-01-08 08:12:13

OP - 问题中不清楚的是 $data 是否可以从一开始就包含 HTML。你能澄清一下吗?

如果 $data 可以包含 HTML 本身,那么您就进入了尝试使用常规语言解析器解析非常规语言的领域,而这不会很好地解决问题。

在这种情况下,我建议将 $data HTML 加载到 PHP DOMDocument 中,获取所有文本节点,并依次对每个文本块的内容运行其他完美答案之一。

OP - something that's not clear in the question is whether $data can contain HTML from the get-go. Can you clarify this?

If $data can contain HTML itself, you are getting into the realms attempting to parse a non-regular language with a regular language parser, and that's not going to work out well.

In such a case, I would suggest loading the $data HTML into a PHP DOMDocument, getting hold of all of the textNodes and running one of the other perfectly good answers on the contents of each text block in turn.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文