正则表达式剥离标签,保留 CDATA
大家好,
我知道每个人都喜欢正则表达式问题,所以这是我的。我有一个 XML 树,其中一些节点包含 CDATA。如何仅返回包含数据的字符串?
让我们看一个例子
<xml>
<node>I'm plain text.</node>
<node><![CDATA[I'm text in cdata... and may contain html, <strong>yikes!</strong>]]></node>
</xml>
会返回
I'm plain text. I'm text in cdata... and may contain html, yikes!
我读过关于不使用常规语言解析不规则语言的内容,但我确信这是可行的。小伙伴们你们觉得怎么样呢?
谢谢, Kevin
编辑: 这是一个需要快速而肮脏的解决方案来处理几行 XML 的问题。我对最初的断然拒绝感到惊讶,但通过进一步阅读(特别是后来提供的链接),我发现经验丰富的程序员知道这是应该尽可能避免的事情。生活和学习。谢谢。
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
Hi all,
I know how everyone loves a regex question, so here is mine. I have an XML tree within which some nodes contain CDATA. How do I return just a string containing the data?
Lets see an example
<xml>
<node>I'm plain text.</node>
<node><![CDATA[I'm text in cdata... and may contain html, <strong>yikes!</strong>]]></node>
</xml>
Would return
I'm plain text. I'm text in cdata... and may contain html, yikes!
I've read about not parsing an irregular language with a regular one, but I'm sure this is doable. What do you reckon guys?
Thanks,
Kevin
EDIT: This was a problem that needed a quick and dirty solution to deal with a few lines of XML. I was surprised at the initial flat refusal, but from further reading (in particular from links provided later on) I see that experienced programmers know it's something that should be avoided wherever possible. Live and learn. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不要使用正则表达式,而使用 XML/HTML 解析器。
这个问题已经被打死了。
Don't use regex, use an XML/HTML parser.
This issue has been beaten to death.
请查看 boilerpipe 示例,了解解决此问题有多困难。
Look at boilerpipe for an example of how hard it is to solve this problem.