使用Python中的BeautifulSoup从XML中的嵌套标签中提取文本
我正在尝试从嵌套标签中提取文本,例如 xml 的格式为:
<thread id = 1_1>
<post id = 1>
<title>
<ne>MediaPortal</ne> Install Guide
</title>
<content>
<ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites
<ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
front-end. It does everything you can ask for in a media center: video
playback, music playback, photo viewing, weather, TV tuning and recording,
etc. It has wide community support and thanks to it's excellent plug-in
and skinning framework, there are lots of community-developed extensions
you can pick and choose to make it your own. It is far more configurable
than <ne>Windows Media Center</ne>, and it works out-of-the-box with the
<ne>MCE</ne> remote. And because it provides so much more configuration
some find it a daunting task to install and configure. Therefore, this
guide will help alleviate some of that burden and help get a
<ne>MediaPortal</ne> installation up & running. This guide is not
intended to replace the wonderful <ne>MediaPortal</ne> documentation, but
rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
a quick and easy set-up guide. If you need more details on configuration
</content>
</post>
</thread>
我需要提取标签内的数据并将其保存在单独的文件中。我能够做到这一点,然后我从美丽的汤对象中提取标签。现在,我想从 和 标签中提取文本并将其放入单独的文件中。请给出一些如何实现这一目标的建议。
从汤对象中提取标签后,如果我这样做
for title in soup.find('title')
print title.string
,那么在提取标签之前,它会在控制台上为具有标签的标题标签提供 None 。
I am trying to extract the text out of nested tags for example the xml is in the form:
<thread id = 1_1>
<post id = 1>
<title>
<ne>MediaPortal</ne> Install Guide
</title>
<content>
<ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites
<ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
front-end. It does everything you can ask for in a media center: video
playback, music playback, photo viewing, weather, TV tuning and recording,
etc. It has wide community support and thanks to it's excellent plug-in
and skinning framework, there are lots of community-developed extensions
you can pick and choose to make it your own. It is far more configurable
than <ne>Windows Media Center</ne>, and it works out-of-the-box with the
<ne>MCE</ne> remote. And because it provides so much more configuration
some find it a daunting task to install and configure. Therefore, this
guide will help alleviate some of that burden and help get a
<ne>MediaPortal</ne> installation up & running. This guide is not
intended to replace the wonderful <ne>MediaPortal</ne> documentation, but
rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
a quick and easy set-up guide. If you need more details on configuration
</content>
</post>
</thread>
I need to extract data within the tags and save it in a separate file. I am able to do that and then I extract the tag having out of the beautiful soup object. Now, I want to extract the text from the and tags and put it in a separate file. Please give some suggestion how can this be achieved.
After extracting the tags out of the soup object if I do
for title in soup.find('title')
print title.string
then it gives None on console for title tags having tags before extracting tags.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
来自
BeautifulSoup
文档:但是,在您的情况下:
因此,在您的情况下,您不能使用
tag.string
。但是,您仍然可以使用tag.contents
或tag.text
:From
BeautifulSoup
documentation:However, in your case:
Hence, in your case, you cannot use
tag.string
. However, you can still usetag.contents
ortag.text
: