如何使用正则表达式从Python字符串中删除标签? (不在 HTML 中)
我需要从 python 中的字符串中删除标签。
<FNT name="Century Schoolbook" size="22">Title</FNT>
删除两端的整个标签,只留下“标题”的最有效方法是什么?我只见过使用 HTML 标签来做到这一点的方法,而这在 python 中对我来说不起作用。我特别将其用于 ArcMap(一个 GIS 程序)。它的布局元素有自己的标签,我只需要删除两个特定标题文本元素的标签。我相信正则表达式应该可以很好地解决这个问题,但我愿意接受任何其他建议。
I need to remove tags from a string in python.
<FNT name="Century Schoolbook" size="22">Title</FNT>
What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这应该有效:
对于每个说正则表达式不是该工作的正确工具的人:
问题的上下文是这样的,所有关于常规/上下文无关语言的反对意见都是无效的。他的语言本质上由三个实体组成:
a = <
、b = >
和c = [^><]+
。他想要删除所有出现的acb
。这相当直接地将他的问题描述为涉及上下文无关语法的问题,并且将其描述为常规问题也并不困难。我知道每个人都喜欢“你不能用正则表达式解析HTML”的答案,但OP不想解析它,他只想执行一个简单的转换。
This should work:
To everyone saying that regexes are not the correct tool for the job:
The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities:
a = <
,b = >
, andc = [^><]+
. He wants to remove any occurrences ofacb
. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.
请避免使用正则表达式。尽管正则表达式可以处理简单的字符串,但如果您得到复杂的字符串,将来就会遇到问题。
您可以使用 BeautifulSoup
get_text()
功能。Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.
You can use BeautifulSoup
get_text()
feature.搜索此正则表达式并将其替换为空字符串应该可以。
示例(来自 python shell):
Searching this regex and replacing it with an empty string should work.
Example (from python shell):
如果它只是为了解析和检索值,你可以看看 BeautifulStoneSoup。
If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.
如果源文本是格式良好的 XML,则可以使用 stdlib 模块 ElementTree:
如果源格式不正确,BeautifulSoup 是一个很好的建议。正如几位发帖者指出的那样,使用正则表达式来解析标签并不是一个好主意。
If the source text is well-formed XML, you can use the stdlib module ElementTree:
If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.
使用 XML 解析器,例如 ElementTree。正则表达式不是这项工作的正确工具。
Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.