为什么需要如此多的 HTML 输入清理?
我已经用 C 语言为我的 html 网站实现了一个搜索引擎。我的整个网络都是用 C 编程的。
我知道 html 输入清理是必要的,因为攻击者可以将这 2 个 html 片段输入到我的搜索页面中,以欺骗我的搜索页面下载和显示外国图像/脚本(XSS):
<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>
这些不是吗?只需搜索“<”即可防止攻击和“>”并将它们从搜索查询中删除?这是否会使这两个脚本变得无用,因为它们不会被视为 html ?我见过的 html 过滤远远超出了这个范围,它们过滤了绝对所有的 JavaScript 命令和 html 标记!
I have implemented a search engine in C for my html website. My entire web is programmed in C.
I understand that html input sanitization is necessary because an attacker can input these 2 html snippets into my search page to trick my search page into downloading and displaying foreign images/scripts (XSS):
<img src="path-to-attack-site"/>
<script>...xss-code-here...</script>
Wouldn't these attacks be prevented simply by searching for '<' and '>' and stripping them from the search query ? Wouldn't that render both scripts useless since they would not be considered html ? I've seen html filtering that goes way beyond this where they filter absolutely all the JavaScript commands and html markup !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
输入清理本质上并不是“必要的”。
删除输入中不需要的控制字符之类的内容是一个好主意,当然对于特定字段,您需要特定的类型检查(以便例如电话号码包含数字)。
但是,为了抵御跨站点脚本攻击而在所有表单输入中运行转义/剥离函数绝对是错误的做法。遗憾的是,这种情况很常见,但它既没有必要,在许多情况下也不足以防范 XSS。
HTML 转义是一个输出问题,必须在输出阶段解决:也就是说,通常在您将字符串模板化到输出 HTML 页面时解决。将
<
转义为<
,将&
转义为&
,并在属性值中转义您用作属性分隔符的引号,就是这样。无法进行 HTML 注入。如果您尝试在表单输入阶段进行 HTML 转义或过滤,那么每当您输出来自不同来源的数据时,您都会遇到困难,并且您将破坏恰好包含
<
、&
和"
字符。还有其他形式的转义。如果您尝试使用 user 值创建 SQL 查询,此时您需要进行 SQL 字符串转义,这与 HTML 转义完全不同。如果要将提交的值放入 JavaScript 字符串中,则必须进行 JSON 样式转义,这又是完全不同的。如果你想在 URL 查询字符串参数中放入一个值,你需要 URL 转义,而不是 HTML 转义。解决这个问题的唯一明智的方法是将字符串保留为纯文本,并仅在将它们输出到的位置进行转义。不同的上下文,例如 HTML。
嗯,是的,如果您还删除了&符号和引号。但用户将无法在其内容中使用这些字符。想象一下,我们尝试在无法使用
<
、&
或"
的情况下进行此对话!如果您想删除在某些上下文(HTML、JavaScript、CSS...)中使用时可能特殊的每个字符,您必须禁止几乎所有的<
都是有效的 !字符,应允许用户键入该字符,并且应以字面小于号的形式出现在页面上。我很抱歉。
Input sanitisation is not inherently ‘necessary’.
It is a good idea to remove things like control characters that you never want in your input, and certainly for specific fields you'll want specific type-checking (so that eg. a phone number contains digits).
But running escaping/stripping functions across all form input for the purpose of defeating cross-site-scripting attacks is absolutely the wrong thing to do. It is sadly common, but it is neither necessary nor in many cases sufficient to protect against XSS.
HTML-escaping is an output issue which must be tackled at the output stage: that is, usually at the point you are templating strings into the output HTML page. Escape
<
to<
,&
to&
, and in attribute values escape the quote you're using as an attribute delimiter, and that's it. No HTML-injection is possible.If you try to HTML-escape or filter at the form input stage, you're going to have difficulty whenever you output data that has come from a different source, and you're going to be mangling user input that happens to include
<
,&
and"
characters.And there are other forms of escaping. If you try to create an SQL query with the user value in, you need to do SQL string literal escaping at that point, which is completely different to HTML escaping. If you want to put a submitted value in a JavaScript string literal you would have to do JSON-style escaping, which is again completely different. If you wanted to put a value in a URL query string parameter you need URL-escaping, not HTML-escaping. The only sensible way to cope with this is to keep your strings as plain text and escape them only at the point you output them into a different context like HTML.
Well yes, if you also stripped ampersands and quotes. But then users wouldn't be able to use those characters in their content. Imagine us trying to have this conversation on SO without being able to use
<
,&
or"
! And if you wanted to strip out every character that might be special when used in some context (HTML, JavaScript, CSS...) you'd have to disallow almost all punctuation!<
is a valid character, which the user should be permitted to type, and which should come out on the page as a literal less-than sign.I'm so sorry.
在大多数情况下,对括号进行编码确实足以防止 XSS,因为标签之间的任何内容都将显示为纯文本。
Encoding brackets is indeed sufficient in most cases to prevent XSS, as anything between tags will then display as plain-text.