仅捕获字符串的一部分而不进行格式化
我试图仅捕获 & 之间的数字
标签,没有
&使用基本正则表达式的
标记。我尝试过想办法,也许是环顾四周,但我只是还没有那么熟练。这是原始 HTML 的示例:
<em>4<b>4</b>9/<b>5</b>-<b>7</b>0</em>
这是我想要的结果:
449570
问题是有时这些字符串具有格式化 HTML,有时则没有。有时有额外的 -
和 /
符号,有时没有。我正在使用 .*<\/em>
这非常简单!
感谢您的帮助 :)
I'm trying to capture only the digits between the <em>
& </em>
tags, without the <b>
& </b>
tags using basic regex. I've tried to think of ways, maybe lookarounds, but I'm just not that skilled...yet. Here's an example of the raw HTML:
<em>4<b>4</b>9/<b>5</b>-<b>7</b>0</em>
Here is what I'd like the result to be:
449570
The problem is sometimes these strings have the formatting HTML, and sometimes not. Sometimes there are extra -
and /
symbols, sometimes not. I'm using <em>.*<\/em>
which is about as simple as it gets!
Thanks for your help :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
正如之前所说,正则表达式可能不是最简单的解决方案。但是,如果您确实想使用它,那么您最好分两遍进行:
第一个
sed
操作删除所有 html 标签。第二个删除所有非数字字符。As has been said before, regex is probably not the easiest solution for this. But, if you really want to use it then you're probably best doing it in two passes:
The first
sed
operation removes all html tags. The second removes all non-numeric characters.第一:一如既往,您可能不应该在 html 上使用正则表达式。总会有一些边缘情况它无法捕获。
如果您使用某种纯正则表达式,则情况更是如此,并且由于您没有指定其他任何内容,因此我假设这就是您正在使用的。所以真的,不要使用正则表达式。
也就是说,我会将其作为两个正则表达式来执行 - 捕获字符串,然后从捕获的字符串中子出您不需要的任何标签(请记住使用非贪婪匹配来匹配它们!)
First: As always, you probably shouldn't be using regex on html. There will always be edge cases it doesn't catch.
This is even more true if you're using a pure regex of some sort, and since you haven't specified anything else, I'll assume that is what you are using. So really, don't use regex.
That said, I would do this as two regexes - capture the string, then sub out any tags you don't want from the captured string (remember to match them using non-greedy matches!)
例如,如果您使用的是 javascript,请尝试以下操作:
此打印
449570
。E.g. if you're in javascript, try this:
This prints
449570
.编辑:
(?:(?:)?[0-9]*(?:)?)*
编辑 2 :
(?:\D*(\d+)\D*)*?
处理中的非数字字符混合,事实上它看起来比第一个更简单:)。EDIT :
<em>(?:(?:<b>)?[0-9]*(?:</b>)?)*</em>
EDIT 2 :
<em>(?:\D*(\d+)\D*)*?</em>
to handle non-digits characters in the mix, infact it looks event simpler than the first :).