帮助提取此内容+美丽的汤
我正在尝试从这种格式的网站中提取数据
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
请注意,MainContent 可以包含其他标签,但我想要像字符串这样的整个内容
所以我所做的是这样,
_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null
因此 _divTag 将只有主要内容,但这不起作用。谁能告诉我我犯了什么错误以及我应该如何提取主要内容
I am trying to extract data from a site which is in this format
<div id=storytextp class=storytextp align=center style='padding:10px;'>
<div id=storytext class=storytext>
<div class='a2a_kit a2a_default_style' style='float:right;margin-left:10px;border:none;'>
..... extra stuff
</div> **Main Content**
</div>
</div>
Note that the MainContent can contain other tags but i want the entire content like string
So what i did was this
_divTag = data.find( "div" , id = "storytext" )
innerdiv = _divTag.find( "div" ) # find the first div tag
innerdiv.contents[0].replaceWith("") # replace with null
thus the _divTag will have only the main content but this does not work. Can anybody tell what mistake i am making and how should i extract the main content
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只需执行
_divTag.contents[2]
即可。您的格式可能会误导您 - 该文本不属于最里面的 div 标签(如
innerdiv.text
、innerdiv.contents
或innerdiv.findChildren()< /code> 会告诉你)。
如果您缩进原始 XML,事情会变得更清晰:
(PS:我不清楚您的
innerdiv.contents[0].replaceWith("")
的意图是什么?压制属性?不管怎样,BS 的哲学不是编辑解析树,而是简单地忽略 99.9% 的你不关心的 BS 文档。 href="http://www.crummy.com/software/BeautifulSoup/documentation.html" rel="nofollow">此处)。Just do
_divTag.contents[2]
.Your formatting was maybe misleading you - this text does not belong to the innermost div tag (as
innerdiv.text
,innerdiv.contents
orinnerdiv.findChildren()
will show you).It makes things clearer if you indent your original XML:
(PS: I'm not clear what the intent of your
innerdiv.contents[0].replaceWith("")
was? To squelch the attributes? newlines? Anyway, the BS philosophy is not to edit the parse-tree, but simply to ignore the 99.9% that you don't care about. BS Documentation is here).