正则表达式取代 reg 商标

发布于 2024-08-03 20:11:54 字数 659 浏览 5 评论 0原文

我需要一些关于正则表达式的帮助:

我有一个 html 输出,我需要用 包装所有注册商标,

我无法插入 ; 标签在 title 和 alt 属性中,显然我不需要包装已经上标的regs。

以下正则表达式匹配不属于 HTML 标记的文本:

(?<=^|>)[^><]+?(?=<|$)

我正在寻找的示例:

$original = `<div>asd&reg; asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>`

过滤后的字符串应输出:

<div>asd<sup>&reg;</sup> asdasd. asd<sup>&reg;</sup>asd <img alt="qwe&reg;qwe" /></div>

非常感谢您的时间!!!

I need some help with regex:

I got a html output and I need to wrap all the registration trademarks with a <sup></sup>

I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.

The following regex matches text that is not part of a HTML tag:

(?<=^|>)[^><]+?(?=<|$)

An example of what I'm looking for:

$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`

The filtered string should output:

<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>

thanks a lot for your time!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

忆悲凉 2024-08-10 20:11:54

好吧,如果您同意以下限制,这里有一个简单的方法:

那些已经处理的regs具有紧随 ® 之后

echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);

背后的逻辑是:

  1. 我们只替换那些 ®后面没有 和...
  2. 后面没有 > simbol 无需打开 <象征

Well, here is a simple way, if you agree to following limitation:

Those regs that are already processed have the </sup> following right after the ®

echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);

The logic behind is:

  1. we replace only those ® which are not followed by </sup> and...
  2. which are not followed by > simbol without opening < symbol
梦罢 2024-08-10 20:11:54

我真的会使用 HTML 解析器来代替正则表达式,因为 HTML 不是正则表达式,并且会呈现比您想象的更多的边缘情况(忽略您在上面确定的上下文限制)。

你没有说你使用什么技术。如果您将其发布,毫无疑问有人可以推荐合适的解析器。

I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).

You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.

白云不回头 2024-08-10 20:11:54

正则表达式不足以满足您的需求。首先,您必须编写代码来识别内容何时是元素的属性值或文本节点。然后您必须浏览所有内容并使用某种替换方法。我不确定它在 PHP 中是什么,但在 JavaScript 中它看起来像这样:

content[i].replace(/\®/g, "<sup>®</sup>");

Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:

content[i].replace(/\®/g, "<sup>®</sup>");
新人笑 2024-08-10 20:11:54

我同意 Brian 的观点,即正则表达式不是解析 HTML 的好方法,但如果必须使用正则表达式,您可以尝试将字符串拆分为标记,然后在每个标记上运行正则表达式。

我使用 preg_split 来拆分 HTML 标记上的字符串以及短语 ® ——这将留下文本这要么不是一个上标 ® ,要么是一个作为标记的标签。然后,对于每个标记,® 可以替换为 ®

$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';

// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
    [0] => <div>
    [1] => asd® asdasd. asd
    [2] => <sup>®</sup>
    [3] => asd
    [4] => <img alt="qwe®qwe" />
    [5] => </div>
)
*/

foreach ($tokens as &$token)
{
    if ($token[0] == "<") continue; // Skip tokens that are tags
    $token = substr_replace('®', '<sup>®</sup>');
}

$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"

请注意,这是一种幼稚的方法,并且如果输出的格式不符合预期,它可能无法像您希望的那样进行解析(同样,正则表达式不适合 HTML 解析;))

I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.

I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>®</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:

$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';

// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
    [0] => <div>
    [1] => asd® asdasd. asd
    [2] => <sup>®</sup>
    [3] => asd
    [4] => <img alt="qwe®qwe" />
    [5] => </div>
)
*/

foreach ($tokens as &$token)
{
    if ($token[0] == "<") continue; // Skip tokens that are tags
    $token = substr_replace('®', '<sup>®</sup>');
}

$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"

Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文