查找紧随其后的小写字母和大写字母

发布于 2024-12-26 02:08:00 字数 2004 浏览 5 评论 0原文

我的文本如下：

<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font> 
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500>     »» Word wordWord word.<br></font>

每个 ... 中都有一个小写字母，后面紧跟着一个大写字母。例如：

<font color =#0B610B> Word word wordWord word.<br></font>

我想通过按如下方式拆分它们来纠正此错误（即：在它们之间添加冒号和空格）：

<font color =#0B610B> Word word word: Word word.<br></font>

到目前为止，我一直在使用：

(<font color =#0B610B\b[^>]*>)(.*?</font>)

选择 ...，它可以很好地通过 ...。

但是当我使用时：

(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)

它确实找到但选择了一行中 ... 之间的所有内容，而不管其他字体颜色标签如何，并替换其他不需要的内容实例。

我希望它找到并替换每个特定标签对中的错误：...，而不是抓取以 < 开头的所有内容;font color =#0B610B> 并以结尾

有没有正则表达式可以解决这个问题？非常感谢。

原文

My text is as below:

<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font> 
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500>     »» Word wordWord word.<br></font>

There is a lowercase letter immediately followed by an uppercase in each of the .... For example:

<font color =#0B610B> Word word wordWord word.<br></font>

I want to correct this error by splitting them as follows (i.e: adding a colon and a space between them):

<font color =#0B610B> Word word word: Word word.<br></font>

So far, I have been using:

(<font color =#0B610B\b[^>]*>)(.*?</font>)

to select each of the instances of ..., and it works fine in finding one instance by one instance of ....

But when I use:

(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)

it does find but selects everything between ...in one line regardless of other font-color tags, and replaces other unwanted instances.

I want it to find and replace error in each of this specific pair of tags: ..., not grabbing everything starting by  and ending in 

Are there any regular expressions to solve this problem? Many thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

述情 2025-01-02 02:08:00

一般来说，正则表达式对于解析 HTML 来说不是一个好主意（如果它是一次性的，你可能没问题）。

我认为这可能是您的正则表达式不起作用的原因。
您能举一个正则表达式失败的例子吗？

我能想到的一种情况是匹配的中没有匹配项 ([az][AZ])一对，但相邻中有。例如：

<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>

在本例中，唯一有效匹配是 word wordword word以及字符串 Word 的其余部分，这就是正则表达式匹配的内容（因为如果它可以匹配它就会匹配！）

我可以想一个粗略的解决方法，但我不会推荐它，除非这个任务是一次性的，因为使用 HTML 的正则表达式总是容易出现此类错误！这个正则表达式的效率也很低。尝试（未经测试）：

(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)

它说，“查找标签，后跟尖括号 < not 后跟 /font，或其他任何内容，然后再次跟上 [az][AZ]”。
因此它会尝试确保匹配不会超出边界。

In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).

I think this might be the reason your regex is not working.
Can you give an example of a case in which your regex fails?

One case I can think of if is there is no match ([a-z][A-Z]) within a matching  pair, but there is in a neighbouring . For example:

<font color=#0B610B>word word</font><font color=#000000>word wordWord</font>

In this case, the only valid match is word wordword word and the rest of the string Word, and so this is what the regex matches (since if it can match it will!)

I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):

(<font color =#0B610B\b[^>]*>)(([^<]|<(?!/font))*?[a-z])([A-Z].*?</font>)

It says, "look for the  tag, followed by either an angle bracket < not followed by /font, OR anything else, and again followed by the [a-z][A-Z]".
So it tries to make sure that the match doesn't go over a  boundary.

回复收藏 0 原文

~没有更多了~