PHP preg_replace - h1 标签内不匹配
如果在长 HTML 字符串中找到关键字,我将使用 preg_replace 添加指向关键字的链接。如果在 h1 标签或强标签中找到关键字,我不想添加链接。
下面的正则表达式几乎可以工作,基本上说(我认为):如果关键字没有立即被 h1 标签或强标签包裹,则替换为匹配的关键字,作为指向 google 的粗体链接。
$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<a href="http://www.google.com"><strong>$1</strong></a>', $result, -1);
(我不想在强标签中匹配 if 的原因是因为我递归了很多关键字,所以不想在后续传递中链接已经链接的关键字)
上面的工作正常并且不会匹配
<h1>bobs widgets</h1>
:然而,将匹配以下文本中的关键字,因为 h1 标签并不紧接在关键字的两侧:
<h1>Here are bobs widgets for sale</h1>
我需要将两侧的空格设为可选,并尝试添加 \s* 但这对我没有任何帮助。我将非常感谢在这里朝着正确的方向推动。
I am using preg_replace to add a link to keywords if they are found within a long HTML string. I don't want to add a link if the keyword is found within h1 tags or strong tags.
The below regex nearly works and basically says (I think): If the keyword is not immediately wrapped by either a h1 tag or a strong tag then replace with the keyword that was matched, as a bolded link to google.
$result = preg_replace('%(?!<h1>)(?!<strong>)\b(bobs widgets)\b(?!<\/strong>)(?!<\/h1>)%i','<a href="http://www.google.com"><strong>$1</strong></a>', $result, -1);
(the reason I don't want to match if in strong tags is because I am recursing through a lot of keywords so don't want to link an already linked keyword on subsequent passes)
the above works fine and won't match:
<h1>bobs widgets</h1>
It will however match the keyword in the following text, because the h1 tag isn't immediately either side of the keyword:
<h1>Here are bobs widgets for sale</h1>
I need to make the spaces either side optional and have tried adding \s* but that doesn't get me anywhere. I'd be very grateful for a push in the right direction here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正则表达式不适合这项工作。这在 Stack Overflow 上已经讨论过很多次了(例如 网站上最著名的帖子)。
您需要的是一个 HTML 解析器,例如 简单 HTML DOM 解析器。帮自己一个忙,从一开始就使用这样的东西。想象一下,当您遇到
时会发生什么,其中有人添加了属性,或者可能有人不正确地关闭了标签,因此您在
< 上出现了混乱的顺序;/strong>
和。让这样的事情与正则表达式一起工作是不值得的,有时甚至是不可能的。
Regular expressions are the wrong tool for this job. This has been discussed many times on Stack Overflow (such as the most famous thread on the site).
What you need is an HTML parser, such as the Simple HTML DOM Parser. Do yourself a favour and use something like this from the start. Imagine what's going to happen when you run into an
<h1>
where someone has added an attribute, or perhaps someone has improperly closed the tags, so you have a mixed up order on a</strong>
and a</h1>
. Getting things like that to work with a regular expression is not worth the trouble, and sometimes isn't even possible....只要记住,这种方法最终会导致悲伤,你需要开始寻找更好的方法。一种方法是使用“tidy”将 html 修复为可解析的 xml,然后 php 提供一些 xml 操作 API 来处理数据。
无论如何,这是一个答案。
您可以添加一些通配符来代替单词边界。像这样的事情应该可以解决问题:
然后,添加更多替换标记以将文本的其余部分保留在输出中:
现在点击“保存”并隐藏在沙发后面;)
... just remember that eventually this approach will lead to sadness, and you'll need to start looking for a better approach. One way is to use 'tidy' to fix up your html into parseable xml, and then php offers a few xml manipulation APIs to work with the data.
Here's an answer anyway.
You can add some wildcards instead of the word boundaries. Something like this should do the trick:
Then, add some more replacement markers to keep the remainder of your text in the output:
Now hit save and hide behind the sofa ;)