当前位置：文江博客话题详情

使用 python 从文档中剥离（XML？）标记

发布于 2025-01-04 20:20:43 字数 160 浏览 6 评论 0原文

我的文件包含以下格式的科学家姓名 <科学家姓名>; <科学家>abc 我想使用 python 从上面的格式中删除科学家的名字我应该怎么做？我想使用常规表情但不知道如何使用它......请帮助

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自由如风 2025-01-11 20:20:43

不要使用正则表达式！（所有原因都在[此处]中有很好的解释）

使用 xml/ html 解析器，看看美丽汤。

回复收藏 0 原文

南渊 2025-01-11 20:20:43

这是 XML，您应该使用像 lxml 这样的 XML 解析器，而不是正则表达式（因为 XML 不是一种常规语言）。

这是一个例子：

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text

This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).

Here is an example:

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text

回复收藏 0 原文

深者入戏 2025-01-11 20:20:43

如前所述，这似乎是 xml。在这种情况下，您应该使用 xml 解析器来解析该文档；我推荐 lxml ( http://lxml.de )。

考虑到您的要求，您可能会发现使用 SAX 样式解析比 DOM 样式更方便，因为 SAX 解析只涉及在解析器遇到特定标记时注册处理程序，只要含义一致即可标签的含义不依赖于上下文，并且您有不止一种类型的标签需要处理（这里可能不是这种情况）。

如果您的输入文档格式可能不正确，您可能希望使用 Beautiful Soup： http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#解析 XML

回复收藏 0 原文

温馨耳语 2025-01-11 20:20:43

这是一个应该为您处理 xml 标签的简单示例

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

如果您发现任何不清楚的地方请告诉我

Here is an simple example that should handle the xml tags for you

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

If you find anything unclear just let me know

回复收藏 0 原文

~没有更多了~