如何获取有效的 HTML 或 Markdown 前 300 个字符？

发布于 2024-09-25 15:21:12 字数 957 浏览 7 评论 0原文

我正在使用 Python 和 Flask 创建一个博客（以及网站的其余部分）。博客文章是用 Markdown 编写的，并使用创造性命名的 Markdown in Python 转换为 HTML。 Markdown（用于将来编辑）和 HTML（用于显示）都存储在数据库中。

我希望能够自动获取文本的前 300 个字符（或 500 个或 200 个 - 我还没有算出数字），以便在我不想显示文本时在页面上使用完整的博客文章（如首页）。然而，问题是，任何简单的方法都会可能给我留下无效的 HTML 或 Markdown：

HTML：

<p><em>Here</em> is <strong>formatted</strong> text.</p>

如果我得到这个的前十个字符，它会让我格式化中途，我会以某种方式需要关闭和

标签。

Markdown：

*Here* is **formatted** text.

同样，获取前十个字符将让我需要关闭 ** 以显示粗体。

有什么办法可以做到这一点，而无需编写 HTML 或 Markdown 解析器？ 或者，我最好将 HTML 转换为纯文本吗？

原文

I'm creating a blog (and the rest of a website) using Python and Flask. Blog posts are written in Markdown and converted to HTML using the creatively named Markdown in Python. Both the Markdown (for future editing) and the HTML (for display) are stored in the database.

I want to be able to automatically get the first 300 characters of text (or 500, or 200 — I haven't worked out the number) to use on pages when I don't want to display the full blog post (like on the front page). However, the problem is that any simple way of doing it will potentially leave me with invalid HTML or Markdown:

HTML:

<p><em>Here</em> is <strong>formatted</strong> text.</p>

If I get the first ten characters of this, it will leave me halfway through formatted, and I would somehow need to close the <strong> and <p> tag.

Markdown:

*Here* is **formatted** text.

Likewise, getting the first ten characters will leave me needing to close the ** for bold.

Is there any way I can do this without needing to write a HTML or Markdown parser? Or, would I be better off just converting the HTML into plain text?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清泪尽 2024-10-02 15:21:12

如果您同意摘要只是纯文本，那么 Adam 的答案肯定是最好的 - 首先转换为纯文本，然后截断。

如果您想保持格式，那么这里有另一个想法：

从 Markdown 转换为 HTML。
使用将为您提供令牌流的解析器运行 HTML（例如 Perl 的 HTML::TokeParser::Simple，但我确信 Python 有类似的东西——或者你可以将任何基于事件的解析器变成其中之一）。
当您获取元素标记时，将它们复制到输出，同时维护一堆未封闭的标记。
当您获得文本标记时，将它们复制到输出，同时维护已输出文本量的计数。
当您遇到超出限制的文本标记时，仅复制足够的字符以达到限制，为堆栈上的任何未关闭标签生成结束标签，然后停止处理。

如果您使用任意 HTML 执行此操作，那么您将需要担心很多奇怪的事情，但由于您来自 Markdown，因此它实际上应该工作得很好。任何像样的 Markdown 转换器都应该生成格式良好的 HTML，其中包含相当少量的标签。

回复收藏 0 原文