什么标记语言适合格式丰富的内容?

发布于 2024-07-10 00:20:09 字数 709 浏览 9 评论 0原文

当您开发基于 Web 的应用程序并且希望允许用户输入格式丰富的文本时,您必须选择如何允许该输入。 人们创建了许多不同的标记语言,因为清理 HTML 无疑更加困难。

各种不同标记语言的优点和缺点是什么,例如:

或者换句话说,哪些因素影响在选择使用特定标记语言时您会考虑吗?

When you are developing a web-based application and you want to allow richly formatted text from the user you have to make a choice about how to allow that input. Many different markup languages have been created because it is arguably more difficult to sanitize HTML.

What are the advantages and disadvantages of the various different markup languages like:

Or to put it differently, what factors do you consider when choosing to use a particular markup language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

尹雨沫 2024-07-17 00:20:09

Markdown、BBCode、Textile、MediaWiki 标记基本上都是相同的一般概念,所以我实际上只是将其分为两类:HTML 和纯文本标记。

HTML

HTML 的处理是内容已经是 Web 内容的“可呈现”形式。 这很棒,节省了处理时间,而且它是一种易于解析的语言。 几乎任何语言都有数十个库来处理 HTML 内容、将 HTML 转换为其他格式或从 HTML 转换为其他格式等。主要缺点是,由于早期 Web 时代的松散标准,HTML 可能会非常多变,您可以当接受用户的 HTML 时,并不总是依赖于合理的输入。 正如所指出的,整理或清理 HTML 通常非常困难,特别是因为它无法像 XML 那样遵循正常的标记规则(即,不正确的闭合标签很常见)。

纯文本标记

此类别经常使用的原因如下:

  • 易于从一个来源解析为多种形式 - PDF、HTML、RTF
  • 如果以后需要,内容以可读的纯文本形式存储(通常比原始 HTML 更容易阅读)日期,而不需要从 HTML 中提取
  • 遵循特定定义的规则,其中 HTML 可能是烦人的变量和非结构化的
  • 允许您强制内容格式的子集,这在许多情况下比简单地允许完整的 HTML
  • 除了强制 HTML 的子集之外可以轻松清理输入并防止跨站点脚本问题等。
  • 以抽象格式保存“原始”数据意味着以后,如果您想将站点从 HTML 4 转换为 XHTML,您只需更改解析代码。 对于 HTML 格式的用户输入,您现在必须将所有 HTML 单独转换为 XHTML,正如 HTML Tidy 所示,这并不总是一项简单的任务。 同样,如果在某个时候出现新的标记语言,或者您需要转向替代格式(RTF、PDF、TeX),则文本格式选项的抽象受限子集会使任务变得更加简单。

底线是用户输入的用途。 如果您计划保留数据并且可能需要调整格式等,那么使用仔细的抽象格式来存储信息是有意义的。 如果您出于任何原因需要手动处理原始数据,那么如果该格式易于人类阅读,则可以加分。 如果您仅在网页(或报告的 HTML 文档等)中显示内容,并且不担心转换它或面向未来,那么将其存储在 HTML 中是合理的做法。

Markdown, BBCode, Textile, MediaWiki markup are all basically the same general concept, so I would really just lump this into two categories: HTML, and plain text markup.

HTML

The deal with HTML is the content is already in a "presentable" form for web content. That's great, saves processing time, and it's a readily parse-able language. There are dozens of libraries in pretty much any language to handle HTML content, convert to/from HTML to other formats, etc. The main downside is that because of the loose standards of the early web days, HTML can be incredibly variable and you can't always depend on sane input when accepting HTML from users. As pointed out, tidying or santizing HTML is often very difficult, especially because it fails to follow normal markup rules the way XML does (i.e. improperly closed tags are common).

Plain Text Markup

This category is frequently used for the following reasons:

  • Easy to parse into multiple forms from one source - PDF, HTML, RTF
  • Content is stored in readable plain text (usually much easier to read than raw HTML) if needed at some later date, rather than needing to extract from the HTML
  • Follows specific defined rules where HTML can be annoying variable and unstructured
  • Allows you to force a subset of content formatting that's more appropriate in many cases than simply allowing full HTML
  • In addition to forcing a subset of HTML makes it easy to sanitize input and prevent cross site scripting problems etc.
  • Keeping the "raw" data in an abstracted format means that at a later date, if you for instance wanted to convert your site from HTML 4 to XHTML, you only need to change the parsing code. With HTML formatted user input, you're stuck now having to convert all the HTML to XHTML individually, which as HTML Tidy shows, is not always a simple task. Similarly if a new markup language comes along at some point or you need to move to an alternative format (RTF, PDF, TeX) an abstracted restricted subset of text formatting options makes that a much simpler task.

Bottom line is what is the user input being used for. If you're planning to keep the data around and may need to shuffle formats etc. then it makes sense to use a careful abstract format to store the information. If you need to work with the raw data manually for any reason, then bonus points if that format is easily human-readable. If you're only displaying the content in a web page (or HTML doc for a report etc.) and you have no concerns about converting it or future-proofing it, then it's a reasonable practice to store it in HTML.

诠释孤独 2024-07-17 00:20:09

Jeff 讨论了codinghorror.com 上的一些优点和缺点,当时他们还处于初始阶段把SO放在一起。 我认为这是一本值得一读的书。

Jeff discussed some pros and cons on codinghorror.com while they were in the initial stages of putting together SO. I thought it was a worthwhile read.

2024-07-17 00:20:09

@netrox 数据库不是问题,浏览器输出才是问题。

唯一关心的是最终的渲染,它可能会被用户插入的 HTML 破坏。 例如,用户可以打开

  • 标签但永远不会关闭它,这取决于页面的结构,可能会破坏随后的整个布局。 或者另一个示例打开 标签而不关闭它,使所有剩余内容变为粗体。
  • 因此,不仅必须验证允许的标签,而且究竟如何允许某些标签而不允许其他标签呢? 因为使用 htmlspecialchars()< 可以很容易地阻止解析所有 HTML 标签。例如,/code>PHP 方法,但是当涉及到允许某些标记时,您将不得不寻找其他方法。 有 strip_tags() PHP 函数可以删除(完全删除)非- 允许的标签,但这意味着以不好的方式改变用户的内容,例如阻止用户发布简单的代码(用于共享/显示的代码,而不是用于处理的代码)。

    除了破坏布局之外,您还必须考虑 XSS 攻击,例如将 javascript 插入链接的 href 属性中,例如,这可能会将用户重定向到另一个站点。 查看可能的 XSS 攻击的长列表:https://www.owasp.org/index.php /XSS_Filter_Evasion_Cheat_Sheet

    如您所见,阻止解释所有 HTML 标记非常容易,但仅阻止某些标记要复杂得多。 要理解这一点,您可以看一下巨大的“HTML Purifier”框架,该框架唯一的目的是允许一些 HTML 标签并确保输出的 HTML 有效(即不会破坏页面)并且不受 XSS 攻击。

    @netrox the database is not the issue, the browser output is.

    The only concern is the final rendering which can be broken by the HTML inserted by the user. For example the user could open a <li> tag but never close it, which depending on how the page is structured, could potentially break the entire layout that follows. Or another example open a <strong> tag without closing it, making all the remaining content bold.

    So not only allowed tags must be validated, but how exactly do you allow some tags but not the others? Because it is very easy to prevent parsing of all HTML tags using htmlspecialchars() PHP method, for example, but when it comes to allowing some of the tags you will have to look for other ways. There is the strip_tags() PHP function which removes (completely delete) non-allowed tags, but then that means altering the user's content in a bad way, preventing the user to post simple code for example (code to share/show, not code to process).

    Beside breaking the layout, you must consider XSS attacks, like inserting javascript into the href attribute of a link, which for example could redirect users to another site. See this long list of possible XSS attacks: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet

    As you can see preventing all HTML tags from being interpreted is very easy, but preventing only some of the tags is much more complicated. To understand that, you could take a look at the enormous "HTML Purifier" framework which only purpose is to allow some HTML tags and make sure that the outputted HTML is valid (i.e. won't break the page) and free of XSS attacks.

    凯凯我们等你回来 2024-07-17 00:20:09

    “已经创建了许多不同的标记语言,因为可以说清理 HTML 更加困难。”

    真的吗? 难度如何? 有一些功能可以删除潜在危险的属性或标签,并在将 HTML 输入数据库或文件之前对其进行验证。 您能举例说明清理 HTML 是多么困难吗?

    "Many different markup languages have been created because it is arguably more difficult to sanitize HTML."

    Really? How is it difficult? There are functions to remove potentially dangerous attributes or tags and validate the HTML before you enter it in database or file. Can you give me examples of how it is difficult to sanitize HTML?

    ~没有更多了~
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文