使用 Python 从 HTML 中提取可读文本?
我知道像 html2text、BeautifulSoup 等实用程序,但问题是它们还提取 javascript 并将其添加到文本中,使得很难将它们分开。
htmlDom = BeautifulSoup(webPage)
htmlDom.findAll(text=True)
或者,
from stripogram import html2text
extract = html2text(webPage)
这两者也会提取页面上的所有 javascript,这是不希望的。
我只是想要提取您可以从浏览器复制的可读文本。
I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.
htmlDom = BeautifulSoup(webPage)
htmlDom.findAll(text=True)
Alternately,
from stripogram import html2text
extract = html2text(webPage)
Both of these extract all the javascript on the page as well, this is undesired.
I just wanted the readable text which you could copy from your browser to be extracted.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您想避免使用 BeautifulSoup 提取
script
标记的任何内容,将为您做到这一点,获取根的直接子级,它们是非脚本标记(以及单独的
htmlDom.findAll (recursive=False, text=True)
将获取作为根的直接子级的字符串)。您需要递归地执行此操作;例如,作为生成器:我使用
childGenerator
(代替findAll
),这样我就可以按顺序获取所有子项并进行自己的过滤。If you want to avoid extracting any of the contents of
script
tags with BeautifulSoup,will do that for you, getting the root's immediate children which are non-script tags (and a separate
htmlDom.findAll(recursive=False, text=True)
will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:I'm using
childGenerator
(in lieu offindAll
) so that I can just get all the children in order and do my own filtering.使用 BeautifulSoup,大致如下:
Using BeautifulSoup, something along these lines:
您可以删除 beautiful soup 中的脚本标签,例如:
删除元素
you can remove script tags in beautiful soup, something like:
Removing Elements
尝试一下:
http://code.google.com/p/boilerpipe/
http://ai-depot. com/articles/the-easy-way-to-extract-useful-text-from-任意-html/
Try it out:
http://code.google.com/p/boilerpipe/
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/