什么标记语言适合格式丰富的内容?
当您开发基于 Web 的应用程序并且希望允许用户输入格式丰富的文本时,您必须选择如何允许该输入。 人们创建了许多不同的标记语言,因为清理 HTML 无疑更加困难。
各种不同标记语言的优点和缺点是什么,例如:
或者换句话说,哪些因素影响在选择使用特定标记语言时您会考虑吗?
When you are developing a web-based application and you want to allow richly formatted text from the user you have to make a choice about how to allow that input. Many different markup languages have been created because it is arguably more difficult to sanitize HTML.
What are the advantages and disadvantages of the various different markup languages like:
Or to put it differently, what factors do you consider when choosing to use a particular markup language.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Markdown、BBCode、Textile、MediaWiki 标记基本上都是相同的一般概念,所以我实际上只是将其分为两类:HTML 和纯文本标记。
HTML
HTML 的处理是内容已经是 Web 内容的“可呈现”形式。 这很棒,节省了处理时间,而且它是一种易于解析的语言。 几乎任何语言都有数十个库来处理 HTML 内容、将 HTML 转换为其他格式或从 HTML 转换为其他格式等。主要缺点是,由于早期 Web 时代的松散标准,HTML 可能会非常多变,您可以当接受用户的 HTML 时,并不总是依赖于合理的输入。 正如所指出的,整理或清理 HTML 通常非常困难,特别是因为它无法像 XML 那样遵循正常的标记规则(即,不正确的闭合标签很常见)。
纯文本标记
此类别经常使用的原因如下:
底线是用户输入的用途。 如果您计划保留数据并且可能需要调整格式等,那么使用仔细的抽象格式来存储信息是有意义的。 如果您出于任何原因需要手动处理原始数据,那么如果该格式易于人类阅读,则可以加分。 如果您仅在网页(或报告的 HTML 文档等)中显示内容,并且不担心转换它或面向未来,那么将其存储在 HTML 中是合理的做法。
Markdown, BBCode, Textile, MediaWiki markup are all basically the same general concept, so I would really just lump this into two categories: HTML, and plain text markup.
HTML
The deal with HTML is the content is already in a "presentable" form for web content. That's great, saves processing time, and it's a readily parse-able language. There are dozens of libraries in pretty much any language to handle HTML content, convert to/from HTML to other formats, etc. The main downside is that because of the loose standards of the early web days, HTML can be incredibly variable and you can't always depend on sane input when accepting HTML from users. As pointed out, tidying or santizing HTML is often very difficult, especially because it fails to follow normal markup rules the way XML does (i.e. improperly closed tags are common).
Plain Text Markup
This category is frequently used for the following reasons:
Bottom line is what is the user input being used for. If you're planning to keep the data around and may need to shuffle formats etc. then it makes sense to use a careful abstract format to store the information. If you need to work with the raw data manually for any reason, then bonus points if that format is easily human-readable. If you're only displaying the content in a web page (or HTML doc for a report etc.) and you have no concerns about converting it or future-proofing it, then it's a reasonable practice to store it in HTML.
Jeff 讨论了codinghorror.com 上的一些优点和缺点,当时他们还处于初始阶段把SO放在一起。 我认为这是一本值得一读的书。
Jeff discussed some pros and cons on codinghorror.com while they were in the initial stages of putting together SO. I thought it was a worthwhile read.
@netrox 数据库不是问题,浏览器输出才是问题。
唯一关心的是最终的渲染,它可能会被用户插入的 HTML 破坏。 例如,用户可以打开
标签但永远不会关闭它,这取决于页面的结构,可能会破坏随后的整个布局。 或者另一个示例打开
标签而不关闭它,使所有剩余内容变为粗体。
因此,不仅必须验证允许的标签,而且究竟如何允许某些标签而不允许其他标签呢? 因为使用
htmlspecialchars()< 可以很容易地阻止解析所有 HTML 标签。例如,/code>
PHP 方法,但是当涉及到允许某些标记时,您将不得不寻找其他方法。 有
strip_tags()
PHP 函数可以删除(完全删除)非- 允许的标签,但这意味着以不好的方式改变用户的内容,例如阻止用户发布简单的代码(用于共享/显示的代码,而不是用于处理的代码)。除了破坏布局之外,您还必须考虑 XSS 攻击,例如将 javascript 插入链接的 href 属性中,例如,这可能会将用户重定向到另一个站点。 查看可能的 XSS 攻击的长列表:https://www.owasp.org/index.php /XSS_Filter_Evasion_Cheat_Sheet
如您所见,阻止解释所有 HTML 标记非常容易,但仅阻止某些标记要复杂得多。 要理解这一点,您可以看一下巨大的“HTML Purifier”框架,该框架唯一的目的是允许一些 HTML 标签并确保输出的 HTML 有效(即不会破坏页面)并且不受 XSS 攻击。
@netrox the database is not the issue, the browser output is.
The only concern is the final rendering which can be broken by the HTML inserted by the user. For example the user could open a
<li>
tag but never close it, which depending on how the page is structured, could potentially break the entire layout that follows. Or another example open a<strong>
tag without closing it, making all the remaining content bold.So not only allowed tags must be validated, but how exactly do you allow some tags but not the others? Because it is very easy to prevent parsing of all HTML tags using
htmlspecialchars()
PHP method, for example, but when it comes to allowing some of the tags you will have to look for other ways. There is thestrip_tags()
PHP function which removes (completely delete) non-allowed tags, but then that means altering the user's content in a bad way, preventing the user to post simple code for example (code to share/show, not code to process).Beside breaking the layout, you must consider XSS attacks, like inserting javascript into the href attribute of a link, which for example could redirect users to another site. See this long list of possible XSS attacks: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
As you can see preventing all HTML tags from being interpreted is very easy, but preventing only some of the tags is much more complicated. To understand that, you could take a look at the enormous "HTML Purifier" framework which only purpose is to allow some HTML tags and make sure that the outputted HTML is valid (i.e. won't break the page) and free of XSS attacks.
“已经创建了许多不同的标记语言,因为可以说清理 HTML 更加困难。”
真的吗? 难度如何? 有一些功能可以删除潜在危险的属性或标签,并在将 HTML 输入数据库或文件之前对其进行验证。 您能举例说明清理 HTML 是多么困难吗?
"Many different markup languages have been created because it is arguably more difficult to sanitize HTML."
Really? How is it difficult? There are functions to remove potentially dangerous attributes or tags and validate the HTML before you enter it in database or file. Can you give me examples of how it is difficult to sanitize HTML?