当前位置：文江博客话题详情

什么标记语言适合格式丰富的内容？

发布于 2024-07-10 00:20:09 字数 709 浏览 9 评论 0原文

当您开发基于 Web 的应用程序并且希望允许用户输入格式丰富的文本时，您必须选择如何允许该输入。人们创建了许多不同的标记语言，因为清理 HTML 无疑更加困难。

各种不同标记语言的优点和缺点是什么，例如：

或者换句话说，哪些因素影响在选择使用特定标记语言时您会考虑吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尹雨沫 2024-07-17 00:20:09

Markdown、BBCode、Textile、MediaWiki 标记基本上都是相同的一般概念，所以我实际上只是将其分为两类：HTML 和纯文本标记。

HTML

HTML 的处理是内容已经是 Web 内容的“可呈现”形式。这很棒，节省了处理时间，而且它是一种易于解析的语言。几乎任何语言都有数十个库来处理 HTML 内容、将 HTML 转换为其他格式或从 HTML 转换为其他格式等。主要缺点是，由于早期 Web 时代的松散标准，HTML 可能会非常多变，您可以当接受用户的 HTML 时，并不总是依赖于合理的输入。正如所指出的，整理或清理 HTML 通常非常困难，特别是因为它无法像 XML 那样遵循正常的标记规则（即，不正确的闭合标签很常见）。

纯文本标记

此类别经常使用的原因如下：

易于从一个来源解析为多种形式 - PDF、HTML、RTF
如果以后需要，内容以可读的纯文本形式存储（通常比原始 HTML 更容易阅读）日期，而不需要从 HTML 中提取
遵循特定定义的规则，其中 HTML 可能是烦人的变量和非结构化的
允许您强制内容格式的子集，这在许多情况下比简单地允许完整的 HTML
除了强制 HTML 的子集之外可以轻松清理输入并防止跨站点脚本问题等。
以抽象格式保存“原始”数据意味着以后，如果您想将站点从 HTML 4 转换为 XHTML，您只需更改解析代码。对于 HTML 格式的用户输入，您现在必须将所有 HTML 单独转换为 XHTML，正如 HTML Tidy 所示，这并不总是一项简单的任务。同样，如果在某个时候出现新的标记语言，或者您需要转向替代格式（RTF、PDF、TeX），则文本格式选项的抽象受限子集会使任务变得更加简单。

底线是用户输入的用途。如果您计划保留数据并且可能需要调整格式等，那么使用仔细的抽象格式来存储信息是有意义的。如果您出于任何原因需要手动处理原始数据，那么如果该格式易于人类阅读，则可以加分。如果您仅在网页（或报告的 HTML 文档等）中显示内容，并且不担心转换它或面向未来，那么将其存储在 HTML 中是合理的做法。

Markdown, BBCode, Textile, MediaWiki markup are all basically the same general concept, so I would really just lump this into two categories: HTML, and plain text markup.

HTML

The deal with HTML is the content is already in a "presentable" form for web content. That's great, saves processing time, and it's a readily parse-able language. There are dozens of libraries in pretty much any language to handle HTML content, convert to/from HTML to other formats, etc. The main downside is that because of the loose standards of the early web days, HTML can be incredibly variable and you can't always depend on sane input when accepting HTML from users. As pointed out, tidying or santizing HTML is often very difficult, especially because it fails to follow normal markup rules the way XML does (i.e. improperly closed tags are common).

Plain Text Markup

This category is frequently used for the following reasons:

Easy to parse into multiple forms from one source - PDF, HTML, RTF
Content is stored in readable plain text (usually much easier to read than raw HTML) if needed at some later date, rather than needing to extract from the HTML
Follows specific defined rules where HTML can be annoying variable and unstructured
Allows you to force a subset of content formatting that's more appropriate in many cases than simply allowing full HTML
In addition to forcing a subset of HTML makes it easy to sanitize input and prevent cross site scripting problems etc.
Keeping the "raw" data in an abstracted format means that at a later date, if you for instance wanted to convert your site from HTML 4 to XHTML, you only need to change the parsing code. With HTML formatted user input, you're stuck now having to convert all the HTML to XHTML individually, which as HTML Tidy shows, is not always a simple task. Similarly if a new markup language comes along at some point or you need to move to an alternative format (RTF, PDF, TeX) an abstracted restricted subset of text formatting options makes that a much simpler task.

Bottom line is what is the user input being used for. If you're planning to keep the data around and may need to shuffle formats etc. then it makes sense to use a careful abstract format to store the information. If you need to work with the raw data manually for any reason, then bonus points if that format is easily human-readable. If you're only displaying the content in a web page (or HTML doc for a report etc.) and you have no concerns about converting it or future-proofing it, then it's a reasonable practice to store it in HTML.

回复收藏 0 原文

诠释孤独 2024-07-17 00:20:09

Jeff 讨论了codinghorror.com 上的一些优点和缺点，当时他们还处于初始阶段把SO放在一起。我认为这是一本值得一读的书。

回复收藏 0 原文

亂 2024-07-17 00:20:09

@netrox 数据库不是问题，浏览器输出才是问题。

唯一关心的是最终的渲染，它可能会被用户插入的 HTML 破坏。例如，用户可以打开

标签但永远不会关闭它，这取决于页面的结构，可能会破坏随后的整个布局。或者另一个示例打开 标签而不关闭它，使所有剩余内容变为粗体。

因此，不仅必须验证允许的标签，而且究竟如何允许某些标签而不允许其他标签呢？因为使用 htmlspecialchars()< 可以很容易地阻止解析所有 HTML 标签。例如，/code>PHP 方法，但是当涉及到允许某些标记时，您将不得不寻找其他方法。有 strip_tags() PHP 函数可以删除（完全删除）非- 允许的标签，但这意味着以不好的方式改变用户的内容，例如阻止用户发布简单的代码（用于共享/显示的代码，而不是用于处理的代码）。

除了破坏布局之外，您还必须考虑 XSS 攻击，例如将 javascript 插入链接的 href 属性中，例如，这可能会将用户重定向到另一个站点。查看可能的 XSS 攻击的长列表：https://www.owasp.org/index.php /XSS_Filter_Evasion_Cheat_Sheet

如您所见，阻止解释所有 HTML 标记非常容易，但仅阻止某些标记要复杂得多。要理解这一点，您可以看一下巨大的“HTML Purifier”框架，该框架唯一的目的是允许一些 HTML 标签并确保输出的 HTML 有效（即不会破坏页面）并且不受 XSS 攻击。