如何删除 HTML CDATA 元素中的标签
I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href>
tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以通过正则表达式从字符串中删除任何看起来像 HTML 链接的内容。结果在很大程度上取决于您的输入,但是用空字符串替换
]*>
可能会让您走得很远。无论如何,使用正则表达式处理 HTML 都是蹩脚且临时的。如果您的输入数据集有限且众所周知,并且您需要做的只是一些一次性转换代码,那么蹩脚和临时的可能就足够了,您可以摆脱它。
如果您正在开发旨在长期存在的代码,那么您绝对应该研究可用的 HTML 解析器之一(BeautifulSoup for Python 或 HTML Agility Pack for .NET 来记住),不仅以结构化的方式处理 HTML,而且还可以在处理时修复它。
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing
</?a\b[^>]*>
with the empty string could get you pretty far.In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.