将多个链接插入文本,忽略恰好插入的匹配项

发布于 2024-09-19 17:50:20 字数 2082 浏览 7 评论 0原文

我正在工作的网站有一个充满术语表的数据库表。我正在构建一个函数,该函数将采用一些 HTML 并用工具提示链接替换词汇表术语的第一个实例。

不过我遇到了问题。由于这不仅仅是一次替换,该函数正在替换先前迭代中插入的文本,因此 HTML 会变得混乱。

我想底线是,我需要忽略文本,如果它:

  • 出现在任何 HTML 标记的 <> 中,或者
  • 出现在 < 的文本中代码> 标签。

这是我到目前为止所拥有的。我希望有人能有一个聪明的解决方案。

function insertGlossaryLinks($html)
{
    // Get glossary terms from database, once per request
    static $terms;
    if (is_null($terms)) {
        $query = Doctrine_Query::create()
            ->select('gt.title, gt.alternate_spellings, gt.description')
            ->from('GlossaryTerm gt');
        $glossaryTerms = $query->rows();

        // Create whole list in $terms, including alternate spellings
        $terms = array();
        foreach ($glossaryTerms as $glossaryTerm) {

            // Initialize with title
            $term = array(
                'wordsHtml' => array(
                    h(trim($glossaryTerm['title']))
                    ),
                'descriptionHtml' => h($glossaryTerm['description'])
                );

            // Add alternate spellings
            foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
                $alternateSpelling = h(trim($alternateSpelling));
                if (empty($alternateSpelling)) {
                    continue;
                }
                $term['wordsHtml'][] = $alternateSpelling;
            }

            $terms[] = $term;
        }
    }

    // Do replacements on this HTML
    $newHtml = $html;
    foreach ($terms as $term) {
        $callback = create_function('$m', 'return \'<a href="javascript:void(0);" class="glossary-term" title="'.$term['descriptionHtml'].'"><span>\'.$m[0].\'</span></a>\';');
        $term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
        $pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
        $newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
    }

    return $newHtml;
}

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.

I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.

I guess the bottom line is, I need to ignore text if it:

  • Appears within the < and > of any HTML tag, or
  • Appears within the text of an <a></a> tag.

Here's what I have so far. I was hoping someone out there would have a clever solution.

function insertGlossaryLinks($html)
{
    // Get glossary terms from database, once per request
    static $terms;
    if (is_null($terms)) {
        $query = Doctrine_Query::create()
            ->select('gt.title, gt.alternate_spellings, gt.description')
            ->from('GlossaryTerm gt');
        $glossaryTerms = $query->rows();

        // Create whole list in $terms, including alternate spellings
        $terms = array();
        foreach ($glossaryTerms as $glossaryTerm) {

            // Initialize with title
            $term = array(
                'wordsHtml' => array(
                    h(trim($glossaryTerm['title']))
                    ),
                'descriptionHtml' => h($glossaryTerm['description'])
                );

            // Add alternate spellings
            foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
                $alternateSpelling = h(trim($alternateSpelling));
                if (empty($alternateSpelling)) {
                    continue;
                }
                $term['wordsHtml'][] = $alternateSpelling;
            }

            $terms[] = $term;
        }
    }

    // Do replacements on this HTML
    $newHtml = $html;
    foreach ($terms as $term) {
        $callback = create_function('$m', 'return \'<a href="javascript:void(0);" class="glossary-term" title="'.$term['descriptionHtml'].'"><span>\'.$m[0].\'</span></a>\';');
        $term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
        $pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
        $newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
    }

    return $newHtml;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夏了南城 2024-09-26 17:50:20

使用正则表达式处理 HTML 始终是有风险的事情。您将花费很长时间摆弄正则表达式的贪婪和懒惰,以仅捕获不在标签中的文本,而不是在标签名称本身中。我的建议是放弃当前使用的方法,并使用 HTML 解析器解析 HTML,如下所示: http: //simplehtmldom.sourceforge.net/。我以前用过它并推荐给其他人。这是处理复杂 HTML 的一种简单得多的方法。

Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.

你的呼吸 2024-09-26 17:50:20

我最终使用 preg_replace_callback 用占位符替换所有现有链接。然后我插入了新的词汇表术语链接。然后我放回了我替换的链接。

效果很好!

I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.

It's working great!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文