如何使用 BeautifulSoup 从 HTML 中删除评论标签?
我一直在玩BeautifulSoup,非常棒。我的最终目标是尝试从页面中获取文本。我只是想从正文中获取文本,特殊情况是从 或
标签获取标题和/或 alt 属性。
到目前为止,我有这个EDITED &更新的当前代码
:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1)对于我的特殊情况,您建议最好的方法是什么,以不从上面列出的两个标签中排除这些属性?如果这样做太复杂,那么它就没有第 2 步那么重要。
2)我想删除 标签以及它们之间的所有内容。我该怎么办呢?
问题编辑
@jathanism:这里有一些我试图删除但仍然保留的评论标签,即使我使用你的示例
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a>
or <img>
tags.
So far I have this EDITED & UPDATED CURRENT CODE
:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- -->
tags and everything in between them. How would I go about that?
QUESTION EDIT
@jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
直接从 BeautifulSoup 文档 中,您可以轻松删除注释(或任何东西)使用
extract()
:Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using
extract()
:这可能是底层 SGML 解析器的问题:请参阅 http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps。您可以使用
markupMassage
正则表达式来覆盖它 - 直接来自文档:This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a
markupMassage
regex -- straight from the docs:如果您正在 BeautifulSoup 版本 3 中寻找解决方案 BS3 Docs - 评论
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
如果突变不适合你,你可以
if mutation isn't your bag, you can