Python 中的 HTML 截断
是否有一个纯 Python 工具可以获取一些 HTML 并将其截断为尽可能接近给定长度,但确保生成的代码片段格式良好?例如,给定这个 HTML:
<h1>This is a header</h1>
<p>This is a paragraph</p>
它不会产生:
<h1>This is a hea
but:
<h1>This is a header</h1>
或至少:
<h1>This is a hea</h1>
我找不到一个有效的,尽管我找到了一个依赖于 pullparser 的,它既过时又死了。
Is there a pure-Python tool to take some HTML and truncate it as close to a given length as possible, but make sure the resulting snippet is well-formed? For example, given this HTML:
<h1>This is a header</h1>
<p>This is a paragraph</p>
it would not produce:
<h1>This is a hea
but:
<h1>This is a header</h1>
or at least:
<h1>This is a hea</h1>
I can't find one that works, though I found one that relies on pullparser
, which is both obsolete and dead.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我认为您不需要一个成熟的解析器 - 您只需要将输入字符串标记为以下之一:
一旦您拥有这样的标记流,就很容易使用堆栈来跟踪哪些标签需要关闭。我实际上不久前遇到了这个问题,并编写了一个小型库来执行此操作:
https://github.com /eentzel/htmltruncate.py
它对我来说效果很好,并且可以很好地处理大多数极端情况,包括任意嵌套标记、将字符实体计算为单个字符、在格式错误的标记上返回错误等。
它将产生:
以你的例子为例。这或许可以更改,但在一般情况下很难 - 如果您尝试截断为 10 个字符,但
标签未针对另一个字符(例如 300)关闭,该怎么办人物?
I don't think you need a full-fledged parser - you only need to tokenize the the input string into one of:
Once you have a stream of tokens like that, it's easy to use a stack to keep track of what tags need closing. I actually ran into this problem a while ago and wrote a small library to do this:
https://github.com/eentzel/htmltruncate.py
It's worked well for me, and handles most of the corner cases well, including arbitrarily nested markup, counting character entities as a single character, returning an error on malformed markup, etc.
It will produce:
on your example. This could perhaps be changed, but it's hard in the general case - what if you're trying to truncate to 10 characters, but the
<h1>
tag isn't closed for another, say, 300 characters?如果您使用 DJANGO lib,您可以简单地:
无论如何,这是来自 django 的 truncate_html_words 的代码:
If you're using DJANGO lib, you can simply :
Anyways, here's the code from truncate_html_words from django :
我发现 slacy 的答案非常有帮助,如果我有声誉,我会投票赞成它,但还有一件事需要注意。在我的环境中,我安装了 html5lib 以及 BeautifulSoup4。 BeautifulSoup 使用了 html5lib 解析器,这导致我的 html 片段被包裹在 html 和 body 标签中,这不是我想要的。
为了解决这些问题,我告诉 BeautifulSoup 使用 python 解析器:
I found the answer by slacy very helpful and would upvote it if I had the reputation, - however there was one extra thing to note. In my environment I had html5lib installed as well as BeautifulSoup4. BeautifulSoup used the html5lib parser and this resulted in my html snippet being wrapped in html and body tags which is not what I wanted.
To resolve these issues I told BeautifulSoup to use the python parser:
您可以使用 BeautifulSoup 在一行中执行此操作(假设您要截断特定数量的源字符,而不是特定数量的内容字符):
You can do this in one line with BeautifulSoup (assuming you want to truncate at a certain number of source characters, not at a number of content characters):
这将满足您的要求。一个易于使用的 HTML 解析器和错误标记校正器
http://www.crummy。 com/software/BeautifulSoup/
This will serve your requirement.An easy to use HTML parser and bad markup corrector
http://www.crummy.com/software/BeautifulSoup/
我最初的想法是使用 XML 解析器(可能是 python 的 sax 解析器),然后可能计算每个 xml 元素中的文本字符。我会忽略标签字符数,以使其更加一致和简单,但两者都应该是可能的。
My initial thought would be use an XML parser (maybe python's sax parser), then probably count the text characters in each xml element. I would ignore the tags characters count to make it more consistent as well as simpler, but either should be possible.
我建议首先完全解析 HTML,然后截断。一个很棒的 Python HTML 解析器是 lxml。解析和截断后,您可以将其打印回 HTML 格式。
I'd recommend first completely parsing the HTML then truncate. A great HTML parser for python is lxml. After parsing and truncating, you can print it back in to HTML format.
查看 HTML Tidy 来清理/重新格式化/重新缩进 HTML。
Look at HTML Tidy to cleanup/reformat/reindent HTML.