使用 Python 从 HTML 中提取可读文本？

发布于 2024-09-07 18:12:05 字数 342 浏览 7 评论 0原文

我知道像 html2text、BeautifulSoup 等实用程序，但问题是它们还提取 javascript 并将其添加到文本中，使得很难将它们分开。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

或者，

from stripogram import html2text
extract = html2text(webPage)

这两者也会提取页面上的所有 javascript，这是不希望的。

我只是想要提取您可以从浏览器复制的可读文本。

原文

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

泪冰清 2024-09-14 18:12:05

如果您想避免使用 BeautifulSoup 提取 script 标记的任何内容，

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

将为您做到这一点，获取根的直接子级，它们是非脚本标记（以及单独的 htmlDom.findAll (recursive=False, text=True) 将获取作为根的直接子级的字符串）。您需要递归地执行此操作；例如，作为生成器：

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

我使用 childGenerator （代替 findAll），这样我就可以按顺序获取所有子项并进行自己的过滤。

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

回复收藏 0 原文

当梦初醒 2024-09-14 18:12:05

使用 BeautifulSoup，大致如下：

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

Using BeautifulSoup, something along these lines:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

回复收藏 0 原文

难忘№最初的完美 2024-09-14 18:12:05

您可以删除 beautiful soup 中的脚本标签，例如：

for script in soup("script"):
    script.extract()

删除元素

you can remove script tags in beautiful soup, something like:

for script in soup("script"):
    script.extract()

Removing Elements

回复收藏 0 原文

稍尽春風 2024-09-14 18:12:05

尝试一下：

http://code.google.com/p/boilerpipe/

http://ai-depot. com/articles/the-easy-way-to-extract-useful-text-from-任意-html/

回复收藏 0 原文

~没有更多了~

关于作者

病毒体

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

使用 Python 从 HTML 中提取可读文本？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

使用 Python 从 HTML 中提取可读文本？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。