如何使用正则表达式从Python字符串中删除标签？（不在 HTML 中）

发布于 2024-09-17 17:53:19 字数 296 浏览 13 评论 0原文

我需要从 python 中的字符串中删除标签。

<FNT name="Century Schoolbook" size="22">Title</FNT>

删除两端的整个标签，只留下“标题”的最有效方法是什么？我只见过使用 HTML 标签来做到这一点的方法，而这在 python 中对我来说不起作用。我特别将其用于 ArcMap（一个 GIS 程序）。它的布局元素有自己的标签，我只需要删除两个特定标题文本元素的标签。我相信正则表达式应该可以很好地解决这个问题，但我愿意接受任何其他建议。

原文

I need to remove tags from a string in python.

<FNT name="Century Schoolbook" size="22">Title</FNT>

What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠我空喜 2024-09-24 17:53:19

这应该有效：

import re
re.sub('<[^>]*>', '', mystring)

对于每个说正则表达式不是该工作的正确工具的人：

问题的上下文是这样的，所有关于常规/上下文无关语言的反对意见都是无效的。他的语言本质上由三个实体组成：a = <、b = > 和 c = [^><]+ 。他想要删除所有出现的 acb。这相当直接地将他的问题描述为涉及上下文无关语法的问题，并且将其描述为常规问题也并不困难。

我知道每个人都喜欢“你不能用正则表达式解析HTML”的答案，但OP不想解析它，他只想执行一个简单的转换。

This should work:

import re
re.sub('<[^>]*>', '', mystring)

To everyone saying that regexes are not the correct tool for the job:

The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.

I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.

回复收藏 0 原文

帅哥哥的热头脑 2024-09-24 17:53:19

请避免使用正则表达式。尽管正则表达式可以处理简单的字符串，但如果您得到复杂的字符串，将来就会遇到问题。

您可以使用 BeautifulSoup get_text() 功能。

from bs4 import BeautifulSoup

text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)

print(soup.get_text())

Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.

You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)

print(soup.get_text())

回复收藏 0 原文

烟─花易冷 2024-09-24 17:53:19

搜索此正则表达式并将其替换为空字符串应该可以。

/<[A-Za-z\/][^>]*>/

示例（来自 python shell）：

>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title

Searching this regex and replacing it with an empty string should work.

/<[A-Za-z\/][^>]*>/

Example (from python shell):

>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title

回复收藏 0 原文

云之铃。 2024-09-24 17:53:19

如果它只是为了解析和检索值，你可以看看 BeautifulStoneSoup。

回复收藏 0 原文

一笑百媚生 2024-09-24 17:53:19

如果源文本是格式良好的 XML，则可以使用 stdlib 模块 ElementTree：

import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text  # 'Title'

如果源格式不正确，BeautifulSoup 是一个很好的建议。正如几位发帖者指出的那样，使用正则表达式来解析标签并不是一个好主意。

If the source text is well-formed XML, you can use the stdlib module ElementTree:

import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text  # 'Title'

If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.

回复收藏 0 原文