使用 python 从文档中剥离(XML?)标记

发布于 2025-01-04 20:20:43 字数 160 浏览 6 评论 0原文

我的文件包含以下格式的科学家姓名 <科学家姓名>; <科学家>abc 我想使用 python 从上面的格式中删除科学家的名字我应该怎么做? 我想使用常规表情但不知道如何使用它......请帮助

I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>

i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

自由如风 2025-01-11 20:20:43

不要使用正则表达式!(所有原因都在[此处]中有很好的解释)

使用 xml/ html 解析器,看看 美丽汤

DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])

Use an xml/html parser, take a look at BeautifulSoup.

南渊 2025-01-11 20:20:43

这是 XML,您应该使用像 lxml 这样的 XML 解析器,而不是正则表达式(因为 XML 不是一种常规语言)。

这是一个例子:

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text

This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).

Here is an example:

from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""

tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
    print scientist.text
深者入戏 2025-01-11 20:20:43

如前所述,这似乎是 xml。在这种情况下,您应该使用 xml 解析器来解析该文档;我推荐 lxml ( http://lxml.de )。

考虑到您的要求,您可能会发现使用 SAX 样式解析比 DOM 样式更方便,因为 SAX 解析只涉及在解析器遇到特定标记时注册处理程序,只要含义一致即可标签的含义不依赖于上下文,并且您有不止一种类型的标签需要处理(这里可能不是这种情况)。

如果您的输入文档格式可能不正确,您可能希望使用 Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#解析 XML

As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).

Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).

In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML

温馨耳语 2025-01-11 20:20:43

这是一个应该为您处理 xml 标签的简单示例

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

如果您发现任何不清楚的地方请告诉我

Here is an simple example that should handle the xml tags for you

#import library to do http requests:
import urllib2

#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations

#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

If you find anything unclear just let me know

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文