使用Python中的BeautifulSoup从XML中的嵌套标签中提取文本

发布于 2024-12-17 04:16:31 字数 1806 浏览 0 评论 0原文

我正在尝试从嵌套标签中提取文本,例如 xml 的格式为:

<thread id = 1_1>
  <post id = 1>
    <title>
      <ne>MediaPortal</ne> Install Guide
    </title>
    <content>
      <ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites 
      <ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
      front-end. It does everything you can ask for in a media center: video 
      playback, music playback, photo viewing, weather, TV tuning and recording, 
      etc. It has wide community support and thanks to it's excellent plug-in 
      and  skinning framework, there are lots of community-developed extensions 
      you can  pick and choose to make it your own. It is far more configurable 
      than <ne>Windows Media Center</ne>, and it works out-of-the-box with the 
      <ne>MCE</ne> remote. And because it provides so much more configuration 
      some find it a daunting task to install and configure. Therefore, this 
      guide will help alleviate some of that burden and help get a 
      <ne>MediaPortal</ne> installation up &amp; running. This guide is not 
      intended to replace the wonderful <ne>MediaPortal</ne> documentation, but 
      rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
      a quick and easy set-up guide. If you need more details on configuration
    </content>
  </post>
</thread>

我需要提取标签内的数据并将其保存在单独的文件中。我能够做到这一点,然后我从美丽的汤对象中提取标签。现在,我想从 和 标签中提取文本并将其放入单独的文件中。请给出一些如何实现这一目标的建议。

从汤对象中提取标签后,如果我这样做

for title in soup.find('title')
   print title.string

,那么在提取标签之前,它会在控制台上为具有标签的标题标签提供 None 。

I am trying to extract the text out of nested tags for example the xml is in the form:

<thread id = 1_1>
  <post id = 1>
    <title>
      <ne>MediaPortal</ne> Install Guide
    </title>
    <content>
      <ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites 
      <ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
      front-end. It does everything you can ask for in a media center: video 
      playback, music playback, photo viewing, weather, TV tuning and recording, 
      etc. It has wide community support and thanks to it's excellent plug-in 
      and  skinning framework, there are lots of community-developed extensions 
      you can  pick and choose to make it your own. It is far more configurable 
      than <ne>Windows Media Center</ne>, and it works out-of-the-box with the 
      <ne>MCE</ne> remote. And because it provides so much more configuration 
      some find it a daunting task to install and configure. Therefore, this 
      guide will help alleviate some of that burden and help get a 
      <ne>MediaPortal</ne> installation up & running. This guide is not 
      intended to replace the wonderful <ne>MediaPortal</ne> documentation, but 
      rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
      a quick and easy set-up guide. If you need more details on configuration
    </content>
  </post>
</thread>

I need to extract data within the tags and save it in a separate file. I am able to do that and then I extract the tag having out of the beautiful soup object. Now, I want to extract the text from the and tags and put it in a separate file. Please give some suggestion how can this be achieved.

After extracting the tags out of the soup object if I do

for title in soup.find('title')
   print title.string

then it gives None on console for title tags having tags before extracting tags.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怎樣才叫好 2024-12-24 04:16:31

来自 BeautifulSoup 文档:

For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].

但是,在您的情况下:

>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>

因此,在您的情况下,您不能使用 tag.string。但是,您仍然可以使用 tag.contentstag.text

>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'

From BeautifulSoup documentation:

For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].

However, in your case:

>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>

Hence, in your case, you cannot use tag.string. However, you can still use tag.contents or tag.text:

>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文