查找紧随其后的小写字母和大写字母
我的文本如下:
<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500> »» Word wordWord word.<br></font>
每个 ...
中都有一个小写字母,后面紧跟着一个大写字母。例如:
<font color =#0B610B> Word word wordWord word.<br></font>
我想通过按如下方式拆分它们来纠正此错误(即:在它们之间添加冒号和空格):
<font color =#0B610B> Word word word: Word word.<br></font>
到目前为止,我一直在使用:
(<font color =#0B610B\b[^>]*>)(.*?</font>)
选择 ...
,它可以很好地通过 ...
。
但是当我使用时:
(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)
它确实找到但选择了一行中 ...
之间的所有内容,而不管其他字体颜色标签如何,并替换其他不需要的内容实例。
我希望它找到并替换每个特定标签对中的错误:...
,而不是抓取以 < 开头的所有内容;font color =#0B610B>
并以 结尾
有没有正则表达式可以解决这个问题?非常感谢。
My text is as below:
<font size=+2 color=#F07500><b> [ba]</font></b>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font></li></ul>
<ul><li><font color =#F07500> Word word word.<br></font></li></ul>
<ul><li><font color =#0B610B> Word word word wordWord.<br></font></li></ul>
<ul><li><font color =#0B610B> WordWord.<br></font></li></ul>
<br><font color =#E41B17><b>UPPERCASE LETTERS</b></font>
<ul><li><font color =#0B610B> Word word wordWord word.<br></font><br><font color =#E41B17><b>PhD and dataBase</b></font> </li></ul>
<font color =#0B610B> Word word word.<br></font></li></ul><dd><font color =#F07500> »» Word wordWord word.<br></font>
There is a lowercase letter immediately followed by an uppercase in each of the <font color =#0B610B>...</font>
. For example:
<font color =#0B610B> Word word wordWord word.<br></font>
I want to correct this error by splitting them as follows (i.e: adding a colon and a space between them):
<font color =#0B610B> Word word word: Word word.<br></font>
So far, I have been using:
(<font color =#0B610B\b[^>]*>)(.*?</font>)
to select each of the instances of <font color =#0B610B>...</font>
, and it works fine in finding one instance by one instance of <font color =#0B610B>...</font>
.
But when I use:
(<font color =#0B610B\b[^>]*>)(.*?[a-z])([A-Z].*?</font>)
it does find but selects everything between <font color =#0B610B>...</font>
in one line regardless of other font-color tags, and replaces other unwanted instances.
I want it to find and replace error in each of this specific pair of tags: <font color =#0B610B>...</font>
, not grabbing everything starting by <font color =#0B610B>
and ending in </font>
Are there any regular expressions to solve this problem? Many thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一般来说,正则表达式对于解析 HTML 来说不是一个好主意(如果它是一次性的,你可能没问题)。
我认为这可能是您的正则表达式不起作用的原因。
您能举一个正则表达式失败的例子吗?
我能想到的一种情况是匹配的
中没有匹配项 (
[az][AZ]
)一对,但相邻中有。例如:
在本例中,唯一有效匹配是
word wordword word
以及字符串Word
的其余部分,这就是正则表达式匹配的内容(因为如果它可以匹配它就会匹配!)我可以想一个粗略的解决方法,但我不会推荐它,除非这个任务是一次性的,因为使用 HTML 的正则表达式总是容易出现此类错误!这个正则表达式的效率也很低。尝试(未经测试):
它说,“查找
标签,后跟尖括号
<
not 后跟/font
,或其他任何内容,然后再次跟上[az][AZ]
”。因此它会尝试确保匹配不会超出
边界。
In general, regex is not a good idea for parsing HTML (if it's a once-off you might be OK).
I think this might be the reason your regex is not working.
Can you give an example of a case in which your regex fails?
One case I can think of if is there is no match (
[a-z][A-Z]
) within a matching<font color=#0B610B></font>
pair, but there is in a neighbouring<font></font>
. For example:In this case, the only valid match is
<font color=#0B610B>word word</font><font color=#000000>word word
and the rest of the stringWord</font>
, and so this is what the regex matches (since if it can match it will!)I can think of a crude workaround but I wouldn't recommend it unless this task is a once-off because using regex for HTML is always prone to such errors!. This regex is also pretty inefficient. Try (untested):
It says, "look for the
<font colour=xxxx>
tag, followed by either an angle bracket<
not followed by/font
, OR anything else, and again followed by the[a-z][A-Z]
".So it tries to make sure that the match doesn't go over a
</font>
boundary.