帮助提取此内容+美丽的汤

发布于 2024-11-24 02:50:15 字数 661 浏览 0 评论 0原文

我正在尝试从这种格式的网站中提取数据

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
<div id=storytext class=storytext> 
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
..... extra stuff
</div>  **Main Content**
</div>
</div>

请注意，MainContent 可以包含其他标签，但我想要像字符串这样的整个内容

所以我所做的是这样，

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

因此 _divTag 将只有主要内容，但这不起作用。谁能告诉我我犯了什么错误以及我应该如何提取主要内容

原文

I am trying to extract data from a site which is in this format

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
<div id=storytext class=storytext> 
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
..... extra stuff
</div>  **Main Content**
</div>
</div>

Note that the MainContent can contain other tags but i want the entire content like string

So what i did was this

_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null

thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冬天旳寂寞 2024-12-01 02:50:15

只需执行 _divTag.contents[2] 即可。

您的格式可能会误导您 - 该文本不属于最里面的 div 标签（如 innerdiv.text、innerdiv.contents 或 innerdiv.findChildren()< /code> 会告诉你）。

如果您缩进原始 XML，事情会变得更清晰：

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
  <div id=storytext class=storytext> 
    <div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
      ..... extra stuff
    </div>  **Main Content**
  </div>
</div>

（PS：我不清楚您的 innerdiv.contents[0].replaceWith("") 的意图是什么？压制属性？不管怎样，BS 的哲学不是编辑解析树，而是简单地忽略 99.9% 的你不关心的 BS 文档。 href="http://www.crummy.com/software/BeautifulSoup/documentation.html" rel="nofollow">此处）。

Just do _divTag.contents[2].

Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as innerdiv.text, innerdiv.contents or innerdiv.findChildren() will show you).

It makes things clearer if you indent your original XML:

<div id=storytextp class=storytextp align=center style='padding:10px;'> 
  <div id=storytext class=storytext> 
    <div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'> 
      ..... extra stuff
    </div>  **Main Content**
  </div>
</div>

(PS: I'm not clear what the intent of your innerdiv.contents[0].replaceWith("") was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).

回复收藏 0 原文

~没有更多了~