正则表达式取代 reg 商标
我需要一些关于正则表达式的帮助:
我有一个 html 输出,我需要用 包装所有注册商标,
我无法插入 ;
标签在 title 和 alt
属性中,显然我不需要包装已经上标的regs。
以下正则表达式匹配不属于 HTML 标记的文本:
(?<=^|>)[^><]+?(?=<|$)
我正在寻找的示例:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
过滤后的字符串应输出:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
非常感谢您的时间!!!
I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup>
tag in title and alt
properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
好吧,如果您同意以下限制,这里有一个简单的方法:
那些已经处理的regs具有紧随 ® 之后
背后的逻辑是:
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
The logic behind is:
我真的会使用 HTML 解析器来代替正则表达式,因为 HTML 不是正则表达式,并且会呈现比您想象的更多的边缘情况(忽略您在上面确定的上下文限制)。
你没有说你使用什么技术。如果您将其发布,毫无疑问有人可以推荐合适的解析器。
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
正则表达式不足以满足您的需求。首先,您必须编写代码来识别内容何时是元素的属性值或文本节点。然后您必须浏览所有内容并使用某种替换方法。我不确定它在 PHP 中是什么,但在 JavaScript 中它看起来像这样:
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
我同意 Brian 的观点,即正则表达式不是解析 HTML 的好方法,但如果必须使用正则表达式,您可以尝试将字符串拆分为标记,然后在每个标记上运行正则表达式。
我使用
preg_split
来拆分 HTML 标记上的字符串以及短语®
——这将留下文本这要么不是一个上标®
,要么是一个作为标记的标签。然后,对于每个标记,®
可以替换为®
:请注意,这是一种幼稚的方法,并且如果输出的格式不符合预期,它可能无法像您希望的那样进行解析(同样,正则表达式不适合 HTML 解析;))
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using
preg_split
to split the string on HTML tags, as well as on the phrase<sup>®</sup>
-- this will leave text that's either not an already superscript®
or a tag as tokens. Then for each token,®
can be replaced with<sup>®</sup>
:Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )