如何获取有效的 HTML 或 Markdown 前 300 个字符?

发布于 2024-09-25 15:21:12 字数 957 浏览 7 评论 0原文

我正在使用 Python 和 Flask 创建一个博客(以及网站的其余部分)。博客文章是用 Markdown 编写的,并使用创造性命名的 Markdown in Python 转换为 HTML。 Markdown(用于将来编辑)和 HTML(用于显示)都存储在数据库中。

我希望能够自动获取文本的前 300 个字符(或 500 个或 200 个 - 我还没有算出数字),以便在我不想显示文本时在页面上使用完整的博客文章(如首页)。然而,问题是,任何简单的方法都会可能给我留下无效的 HTML 或 Markdown

HTML:

<p><em>Here</em> is <strong>formatted</strong> text.</p>

如果我得到这个的前十个字符,它会让我格式化中途,我会以某种方式需要关闭

标签。

Markdown:

*Here* is **formatted** text.

同样,获取前十个字符将让我需要关闭 ** 以显示粗体。

有什么办法可以做到这一点,而无需编写 HTML 或 Markdown 解析器? 或者,我最好将 HTML 转换为纯文本吗?

I'm creating a blog (and the rest of a website) using Python and Flask. Blog posts are written in Markdown and converted to HTML using the creatively named Markdown in Python. Both the Markdown (for future editing) and the HTML (for display) are stored in the database.

I want to be able to automatically get the first 300 characters of text (or 500, or 200 — I haven't worked out the number) to use on pages when I don't want to display the full blog post (like on the front page). However, the problem is that any simple way of doing it will potentially leave me with invalid HTML or Markdown:

HTML:

<p><em>Here</em> is <strong>formatted</strong> text.</p>

If I get the first ten characters of this, it will leave me halfway through formatted, and I would somehow need to close the <strong> and <p> tag.

Markdown:

*Here* is **formatted** text.

Likewise, getting the first ten characters will leave me needing to close the ** for bold.

Is there any way I can do this without needing to write a HTML or Markdown parser? Or, would I be better off just converting the HTML into plain text?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

清泪尽 2024-10-02 15:21:12

如果您同意摘要只是纯文本,那么 Adam 的答案肯定是最好的 - 首先转换为纯文本,然后截断。

如果您想保持格式,那么这里有另一个想法:

  • 从 Markdown 转换为 HTML。
  • 使用将为您提供令牌流的解析器运行 HTML(例如 Perl 的 HTML::TokeParser::Simple,但我确信 Python 有类似的东西——或者你可以将任何基于事件的解析器变成其中之一)。
  • 当您获取元素标记时,将它们复制到输出,同时维护一堆未封闭的标记。
  • 当您获得文本标记时,将它们复制到输出,同时维护已输出文本量的计数。
  • 当您遇到超出限制的文本标记时,仅复制足够的字符以达到限制,为堆栈上的任何未关闭标签生成结束标签,然后停止处理。

如果您使用任意 HTML 执行此操作,那么您将需要担心很多奇怪的事情,但由于您来自 Markdown,因此它实际上应该工作得很好。任何像样的 Markdown 转换器都应该生成格式良好的 HTML,其中包含相当少量的标签。

If you're okay with summaries just being plain text, then Adam's answer is certainly the best -- convert to plain text first, and then truncate.

If you want to maintain formatting, then here's another idea:

  • Convert from Markdown to HTML.
  • Run through the HTML with a parser of the sort that will give you a token stream (e.g. Perl's HTML::TokeParser::Simple, but I'm sure there's something comparable for Python -- or you can turn any event-based parser into one of these).
  • When you get element tokens, copy them to the output, while maintaining a stack of unclosed tags.
  • When you get text tokens, copy them to the output, while maintaining a count of the amount of text you've outputted.
  • When you get to a text token that would put you over the limit, copy only enough characters to reach the limit, generate closing tags for any unclosed tags on your stack, and stop processing.

If you were doing this with arbitrary HTML then you would have a lot of weird things to worry about, but since you're coming from markdown it should actually work pretty well. Any decent markdown converter should generate well-formed HTML with a fairly small number of tags in it.

遮了一弯 2024-10-02 15:21:12

事实上,最简单、最安全的方法是从 Markdown 源生成 HTML,将其转换为纯文本(请参阅 html2plaintext ),然后将其缩减至 300 个字符。

更有效的方法可能是修改 Markdown 解析器以仅输出所有文本节点的前 300 个字符,但我真的不认为这些修改能够证明性能优势是合理的。

Indeed, the easiest and safest method would be to generate HTML from the Markdown source, convert it to plain text (see html2plaintext), and then trim it down to 300 characters.

A more efficient method might be to modify the Markdown parser to output only the first 300 characters of all the text nodes but I really don't think the modifications justify the performance benefits.

没有心的人 2024-10-02 15:21:12

不知道它是否适用于Python,但本教程可能会有所帮助你。基本上它会在文本被修剪后扫描未关闭的标签并自动关闭它。

don't know if it applicable in Python but this tutorial may help you. Basically it scan for unclosed tag after the text is trimmed and auto-close it.

天生の放荡 2024-10-02 15:21:12

使用事件解析器,忽略非文本事件,捕获文本事件,直到达到 300 个字符,然后停止解析。

libxml 支持基于事件的 html 解析。我确信有一个 markdown,但还没有看过。

不过,您应该进行衡量,以确保性能优势值得增加的复杂性。

Use an evented parser, ignore non text events, capture text events until you reach 300 characters, then stop parsing.

libxml supports event based parsing of html. I'm sure there is one for markdown, but haven't looked.

You should measure though to make sure the performance benefit is worth the added complexity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文