如何获取有效的 HTML 或 Markdown 前 300 个字符?
我正在使用 Python 和 Flask 创建一个博客(以及网站的其余部分)。博客文章是用 Markdown 编写的,并使用创造性命名的 Markdown in Python 转换为 HTML。 Markdown(用于将来编辑)和 HTML(用于显示)都存储在数据库中。
我希望能够自动获取文本的前 300 个字符(或 500 个或 200 个 - 我还没有算出数字),以便在我不想显示文本时在页面上使用完整的博客文章(如首页)。然而,问题是,任何简单的方法都会可能给我留下无效的 HTML 或 Markdown:
HTML:
<p><em>Here</em> is <strong>formatted</strong> text.</p>
如果我得到这个的前十个字符,它会让我格式化中途,我会以某种方式需要关闭和
标签。
Markdown:
*Here* is **formatted** text.
同样,获取前十个字符将让我需要关闭 **
以显示粗体。
有什么办法可以做到这一点,而无需编写 HTML 或 Markdown 解析器? 或者,我最好将 HTML 转换为纯文本吗?
I'm creating a blog (and the rest of a website) using Python and Flask. Blog posts are written in Markdown and converted to HTML using the creatively named Markdown in Python. Both the Markdown (for future editing) and the HTML (for display) are stored in the database.
I want to be able to automatically get the first 300 characters of text (or 500, or 200 — I haven't worked out the number) to use on pages when I don't want to display the full blog post (like on the front page). However, the problem is that any simple way of doing it will potentially leave me with invalid HTML or Markdown:
HTML:
<p><em>Here</em> is <strong>formatted</strong> text.</p>
If I get the first ten characters of this, it will leave me halfway through formatted, and I would somehow need to close the <strong>
and <p>
tag.
Markdown:
*Here* is **formatted** text.
Likewise, getting the first ten characters will leave me needing to close the **
for bold.
Is there any way I can do this without needing to write a HTML or Markdown parser? Or, would I be better off just converting the HTML into plain text?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您同意摘要只是纯文本,那么 Adam 的答案肯定是最好的 - 首先转换为纯文本,然后截断。
如果您想保持格式,那么这里有另一个想法:
如果您使用任意 HTML 执行此操作,那么您将需要担心很多奇怪的事情,但由于您来自 Markdown,因此它实际上应该工作得很好。任何像样的 Markdown 转换器都应该生成格式良好的 HTML,其中包含相当少量的标签。
If you're okay with summaries just being plain text, then Adam's answer is certainly the best -- convert to plain text first, and then truncate.
If you want to maintain formatting, then here's another idea:
If you were doing this with arbitrary HTML then you would have a lot of weird things to worry about, but since you're coming from markdown it should actually work pretty well. Any decent markdown converter should generate well-formed HTML with a fairly small number of tags in it.
事实上,最简单、最安全的方法是从 Markdown 源生成 HTML,将其转换为纯文本(请参阅 html2plaintext ),然后将其缩减至 300 个字符。
更有效的方法可能是修改 Markdown 解析器以仅输出所有文本节点的前 300 个字符,但我真的不认为这些修改能够证明性能优势是合理的。
Indeed, the easiest and safest method would be to generate HTML from the Markdown source, convert it to plain text (see html2plaintext), and then trim it down to 300 characters.
A more efficient method might be to modify the Markdown parser to output only the first 300 characters of all the text nodes but I really don't think the modifications justify the performance benefits.
不知道它是否适用于Python,但本教程可能会有所帮助你。基本上它会在文本被修剪后扫描未关闭的标签并自动关闭它。
don't know if it applicable in Python but this tutorial may help you. Basically it scan for unclosed tag after the text is trimmed and auto-close it.
使用事件解析器,忽略非文本事件,捕获文本事件,直到达到 300 个字符,然后停止解析。
libxml 支持基于事件的 html 解析。我确信有一个 markdown,但还没有看过。
不过,您应该进行衡量,以确保性能优势值得增加的复杂性。
Use an evented parser, ignore non text events, capture text events until you reach 300 characters, then stop parsing.
libxml supports event based parsing of html. I'm sure there is one for markdown, but haven't looked.
You should measure though to make sure the performance benefit is worth the added complexity.