使用 Python 从 HTML 中提取可读文本?

发布于 2024-09-07 18:12:05 字数 342 浏览 7 评论 0原文

我知道像 html2text、BeautifulSoup 等实用程序,但问题是它们还提取 javascript 并将其添加到文本中,使得很难将它们分开。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

或者,

from stripogram import html2text
extract = html2text(webPage)

这两者也会提取页面上的所有 javascript,这是不希望的。

我只是想要提取您可以从浏览器复制的可读文本。

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them.

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

Alternately,

from stripogram import html2text
extract = html2text(webPage)

Both of these extract all the javascript on the page as well, this is undesired.

I just wanted the readable text which you could copy from your browser to be extracted.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

泪冰清 2024-09-14 18:12:05

如果您想避免使用 BeautifulSoup 提取 script 标记的任何内容,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

将为您做到这一点,获取根的直接子级,它们是非脚本标记(以及单独的 htmlDom.findAll (recursive=False, text=True) 将获取作为根的直接子级的字符串)。您需要递归地执行此操作;例如,作为生成器:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

我使用 childGenerator (代替 findAll),这样我就可以按顺序获取所有子项并进行自己的过滤。

If you want to avoid extracting any of the contents of script tags with BeautifulSoup,

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

will do that for you, getting the root's immediate children which are non-script tags (and a separate htmlDom.findAll(recursive=False, text=True) will get strings that are immediate children of the root). You need to do this recursively; e.g., as a generator:

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

I'm using childGenerator (in lieu of findAll) so that I can just get all the children in order and do my own filtering.

当梦初醒 2024-09-14 18:12:05

使用 BeautifulSoup,大致如下:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

Using BeautifulSoup, something along these lines:

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)
难忘№最初的完美 2024-09-14 18:12:05

您可以删除 beautiful soup 中的脚本标签,例如:

for script in soup("script"):
    script.extract()

删除元素

you can remove script tags in beautiful soup, something like:

for script in soup("script"):
    script.extract()

Removing Elements

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文