Python 中的 HTML 截断

发布于 2024-10-17 12:06:10 字数 451 浏览 12 评论 0原文

是否有一个纯 Python 工具可以获取一些 HTML 并将其截断为尽可能接近给定长度，但确保生成的代码片段格式良好？例如，给定这个 HTML:

<h1>This is a header</h1>
<p>This is a paragraph</p>

它不会产生：

<h1>This is a hea

but:

<h1>This is a header</h1>

或至少：

<h1>This is a hea</h1>

我找不到一个有效的，尽管我找到了一个依赖于 pullparser 的，它既过时又死了。

原文

Is there a pure-Python tool to take some HTML and truncate it as close to a given length as possible, but make sure the resulting snippet is well-formed? For example, given this HTML:

<h1>This is a header</h1>
<p>This is a paragraph</p>

it would not produce:

<h1>This is a hea

but:

<h1>This is a header</h1>

or at least:

<h1>This is a hea</h1>

I can't find one that works, though I found one that relies on pullparser, which is both obsolete and dead.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寒尘 2024-10-24 12:06:10

我认为您不需要一个成熟的解析器 - 您只需要将输入字符串标记为以下之一：

文本
打开标记
关闭标记
自闭合标记
字符实体

一旦您拥有这样的标记流，就很容易使用堆栈来跟踪哪些标签需要关闭。我实际上不久前遇到了这个问题，并编写了一个小型库来执行此操作：

https://github.com /eentzel/htmltruncate.py

它对我来说效果很好，并且可以很好地处理大多数极端情况，包括任意嵌套标记、将字符实体计算为单个字符、在格式错误的标记上返回错误等。

它将产生:

<h1>This is a hea</h1>

以你的例子为例。这或许可以更改，但在一般情况下很难 - 如果您尝试截断为 10 个字符，但

标签未针对另一个字符（例如 300）关闭，该怎么办人物？

I don't think you need a full-fledged parser - you only need to tokenize the the input string into one of:

text
open tag
close tag
self-closing tag
character entity

Once you have a stream of tokens like that, it's easy to use a stack to keep track of what tags need closing. I actually ran into this problem a while ago and wrote a small library to do this:

https://github.com/eentzel/htmltruncate.py

It's worked well for me, and handles most of the corner cases well, including arbitrarily nested markup, counting character entities as a single character, returning an error on malformed markup, etc.

It will produce:

<h1>This is a hea</h1>

on your example. This could perhaps be changed, but it's hard in the general case - what if you're trying to truncate to 10 characters, but the <h1> tag isn't closed for another, say, 300 characters?

回复收藏 0 原文

ゃ懵逼小萝莉 2024-10-24 12:06:10

如果您使用 DJANGO lib，您可以简单地：

from django.utils import text, html

    class class_name():


        def trim_string(self, stringf, limit, offset = 0):
            return stringf[offset:limit]

        def trim_html_words(self, html, limit, offset = 0):
            return text.truncate_html_words(html, limit)


        def remove_html(self, htmls, tag, limit = 'all', offset = 0):
            return html.strip_tags(htmls)

无论如何，这是来自 django 的 truncate_html_words 的代码：

import re

def truncate_html_words(s, num):
    """
    Truncates html to a certain number of words (not counting tags and comments).
    Closes opened tags if they were correctly closed in the given html.
    """
    length = int(num)
    if length <= 0:
        return ''
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
    # Set up regular expressions
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
    # Count non-HTML words and keep note of open tags
    pos = 0
    ellipsis_pos = 0
    words = 0
    open_tags = []
    while words <= length:
        m = re_words.search(s, pos)
        if not m:
            # Checked through whole string
            break
        pos = m.end(0)
        if m.group(1):
            # It's an actual non-HTML word
            words += 1
            if words == length:
                ellipsis_pos = pos
            continue
        # Check for tag
        tag = re_tag.match(m.group(0))
        if not tag or ellipsis_pos:
            # Don't worry about non tags or tags after our truncate point
            continue
        closing_tag, tagname, self_closing = tag.groups()
        tagname = tagname.lower()  # Element names are always case-insensitive
        if self_closing or tagname in html4_singlets:
            pass
        elif closing_tag:
            # Check for match in open tags list
            try:
                i = open_tags.index(tagname)
            except ValueError:
                pass
            else:
                # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
                open_tags = open_tags[i+1:]
        else:
            # Add it to the start of the open tags list
            open_tags.insert(0, tagname)
    if words <= length:
        # Don't try to close tags if we don't need to truncate
        return s
    out = s[:ellipsis_pos] + ' ...'
    # Close any tags still open
    for tag in open_tags:
        out += '</%s>' % tag
    # Return string
    return out

If you're using DJANGO lib, you can simply :

from django.utils import text, html

    class class_name():


        def trim_string(self, stringf, limit, offset = 0):
            return stringf[offset:limit]

        def trim_html_words(self, html, limit, offset = 0):
            return text.truncate_html_words(html, limit)


        def remove_html(self, htmls, tag, limit = 'all', offset = 0):
            return html.strip_tags(htmls)

Anyways, here's the code from truncate_html_words from django :

import re

def truncate_html_words(s, num):
    """
    Truncates html to a certain number of words (not counting tags and comments).
    Closes opened tags if they were correctly closed in the given html.
    """
    length = int(num)
    if length <= 0:
        return ''
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
    # Set up regular expressions
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
    # Count non-HTML words and keep note of open tags
    pos = 0
    ellipsis_pos = 0
    words = 0
    open_tags = []
    while words <= length:
        m = re_words.search(s, pos)
        if not m:
            # Checked through whole string
            break
        pos = m.end(0)
        if m.group(1):
            # It's an actual non-HTML word
            words += 1
            if words == length:
                ellipsis_pos = pos
            continue
        # Check for tag
        tag = re_tag.match(m.group(0))
        if not tag or ellipsis_pos:
            # Don't worry about non tags or tags after our truncate point
            continue
        closing_tag, tagname, self_closing = tag.groups()
        tagname = tagname.lower()  # Element names are always case-insensitive
        if self_closing or tagname in html4_singlets:
            pass
        elif closing_tag:
            # Check for match in open tags list
            try:
                i = open_tags.index(tagname)
            except ValueError:
                pass
            else:
                # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
                open_tags = open_tags[i+1:]
        else:
            # Add it to the start of the open tags list
            open_tags.insert(0, tagname)
    if words <= length:
        # Don't try to close tags if we don't need to truncate
        return s
    out = s[:ellipsis_pos] + ' ...'
    # Close any tags still open
    for tag in open_tags:
        out += '</%s>' % tag
    # Return string
    return out

回复收藏 0 原文

差↓一点笑了 2024-10-24 12:06:10

我发现 slacy 的答案非常有帮助，如果我有声誉，我会投票赞成它，但还有一件事需要注意。在我的环境中，我安装了 html5lib 以及 BeautifulSoup4。 BeautifulSoup 使用了 html5lib 解析器，这导致我的 html 片段被包裹在 html 和 body 标签中，这不是我想要的。

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<html><head></head><body><p>s</p></body></html>'

为了解决这些问题，我告诉 BeautifulSoup 使用 python 解析器：

from bs4 import BeautifulSoup
def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length], "html.parser"))

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<p>s</p>'

I found the answer by slacy very helpful and would upvote it if I had the reputation, - however there was one extra thing to note. In my environment I had html5lib installed as well as BeautifulSoup4. BeautifulSoup used the html5lib parser and this resulted in my html snippet being wrapped in html and body tags which is not what I wanted.

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<html><head></head><body><p>s</p></body></html>'

To resolve these issues I told BeautifulSoup to use the python parser:

from bs4 import BeautifulSoup
def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length], "html.parser"))

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<p>s</p>'

回复收藏 0 原文

寄与心 2024-10-24 12:06:10

您可以使用 BeautifulSoup 在一行中执行此操作（假设您要截断特定数量的源字符，而不是特定数量的内容字符）：

from BeautifulSoup import BeautifulSoup

def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length]))

You can do this in one line with BeautifulSoup (assuming you want to truncate at a certain number of source characters, not at a number of content characters):

from BeautifulSoup import BeautifulSoup

def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length]))

回复收藏 0 原文

咿呀咿呀哟 2024-10-24 12:06:10

这将满足您的要求。一个易于使用的 HTML 解析器和错误标记校正器

http://www.crummy。 com/software/BeautifulSoup/

回复收藏 0 原文

野稚 2024-10-24 12:06:10

我最初的想法是使用 XML 解析器（可能是 python 的 sax 解析器），然后可能计算每个 xml 元素中的文本字符。我会忽略标签字符数，以使其更加一致和简单，但两者都应该是可能的。

回复收藏 0 原文

可爱咩 2024-10-24 12:06:10

我建议首先完全解析 HTML，然后截断。一个很棒的 Python HTML 解析器是 lxml。解析和截断后，您可以将其打印回 HTML 格式。

回复收藏 0 原文

地狱即天堂 2024-10-24 12:06:10

查看 HTML Tidy 来清理/重新格式化/重新缩进 HTML。

回复收藏 0 原文

~没有更多了~

关于作者

咆哮

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Python 中的 HTML 截断

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

标签未针对另一个字符（例如 300）关闭，该怎么办人物？

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

Python 中的 HTML 截断

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

标签未针对另一个字符（例如 300）关闭，该怎么办人物？

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。