需要正则表达式专家
我正在尝试编写一个脚本来解析 HTML 块并将单词与给定的术语表进行匹配。如果找到匹配项,则会将该术语包装在 中并提供定义。
它工作正常 - 除了两个主要缺点:
- 它匹配属性中的文本
- 它匹配已经在
标记中的文本,创建了一个嵌套链接。
有什么方法可以让我的正则表达式仅匹配不在属性中且不在 标记中的单词吗?
这是我正在使用的代码,以防相关:
foreach(Glossary::map() as $term => $def) {
$search[] = "/\b($term)\b/i";
self::$lookup[strtoupper($term)] = $def;
}
return preg_replace_callback($search, array(&$this,'replace'),$this->content);
I'm trying to write a script that parses a block of HTML and matches words against a given glossary of terms. If it finds a match, it wraps the term in <a class="tooltip"></a>
and provides a definition.
It's working okay -- except for two major shortcomings:
- It matches text that is in attributes
- It matches text that is already in an
<a>
tag, created a nested link.
Is there any way to have my regular expression match only words that are not in attributes, and not in <a>
tags?
Here's the code I'm using, in case it's relevant:
foreach(Glossary::map() as $term => $def) {
$search[] = "/\b($term)\b/i";
self::$lookup[strtoupper($term)] = $def;
}
return preg_replace_callback($search, array(&$this,'replace'),$this->content);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
“不要用正则表达式这样做。”
使用 HTML 解析器,然后在识别 HTML 元素时将正则表达式应用于 HTML 元素的内容。这将使您能够轻松地操作许多不同的 HTML 结构变体(无论是有效的还是其他形式),而无需使用大量粗糙且难以维护的正则表达式。
强大且成熟的 PHP HTML 解析器
"Don't do that with a regex."
Use an HTML parser, then apply a regex to the contents of HTML elements as it identifies them. That will allow you to easily operate on lots of different variants of HTML structure, valid and otherwise, without a lot of cruft and hard-to-maintain regular expressions.
Robust and Mature HTML Parser for PHP
就我个人而言,我更喜欢这个答案 。
Personally, I prefer this answer.
HTML 解析是一个有趣的研究课题。 HTML 是什么意思?有标准(相当多),并且有网页。大多数研究人员不使用正则表达式来解析 HTML
HTML parsing is an interesting research topic. What do you mean with HTML? There are standards (quite a few), and there are web pages. Most researchers do not use regular expressions to parse HTML