正则表达式和“战争” 跨站脚本攻击

发布于 2024-07-16 23:49:04 字数 1041 浏览 10 评论 0原文

我一直对编写网络软件（例如论坛或博客）感兴趣，这些软件需要有限的标记才能重写为 HTML。但最近，我越来越注意到，对于 PHP，尝试谷歌搜索“PHP BBCode parser -PEAR”并测试一些，你要么得到低效的混乱，要么得到到处都有 XSS 漏洞的糟糕代码。

以我之前提到的例子为例，对于那些糟糕的 BBCode 解析器，您将如何避免 XSS？现在，我将采用典型的正则表达式来处理链接，您可以提及它的脆弱性以及如何避免它。

// Assume input has already been encoded by htmlspecialchars with ENT_QUOTES
$text = preg_replace('#\[url\](.*?)\[/url\]#i','<a href="\1">\1</a>', $text);
$text = preg_replace('#\[url=(.*?)\](.*?)\[/url\]#i','<a href="\1">\2</a>', $text);

处理图像标签几乎没有比这更安全的了。

所以我有几个具体问题，主要是针对 PHP 实现的。

在此示例中，仅使用 uri/url 验证表达式进行匹配是否是更好的做法？或者，最好使用 (.*?) 和回调，然后确定输入是否是有效链接？从上面可以明显看出，javascript:alert('XSS!') 可以在上述 URL 标记中工作，但如果完成 uri 匹配，则会失败。
回调中像 urlencode() 这样的函数怎么样，它们会产生任何威慑或问题吗（就 URI 标准而言）？
编写一个全栈解析器会更安全吗？或者，开发和使用这样的东西所需的时间和处理能力对于每页处理多个不同条目的东西来说是否太重了？

我知道我的例子只是众多例子之一，而且比一些例子更具体。但是，不要回避提供自己的。 因此，我正在寻找文本解析情况下 XSS 保护的原则和最佳实践以及一般建议。

原文

I've always been interested in writing web software like forums or blogs, things which take a limited markup to rewrite into HTML. But lately, I've noticed more and more that for PHP, try googling "PHP BBCode parser -PEAR" and test a few out, you either get an inefficient mess, or you get poor code with XSS holes here and there.

Taking my previously mentioned example, of the poor BBCode parsers out there, how would you avoid XSS? I'll now take your typical regular expression for handling a link, and you can mention how vulnerable it is and how to avoid it.

// Assume input has already been encoded by htmlspecialchars with ENT_QUOTES
$text = preg_replace('#\[url\](.*?)\[/url\]#i','<a href="\1">\1</a>', $text);
$text = preg_replace('#\[url=(.*?)\](.*?)\[/url\]#i','<a href="\1">\2</a>', $text);

Handling image tags are hardly more secure than this.

So I have several specific questions, mostly specific to PHP implementations.

Is it better practice, in this example, to only match using a uri/url validation expression? Or, is it better to use (.*?) and a callback, then ascertain whether or not the input is a valid link? As would be obvious above, the javascript:alert('XSS!') would work in the above URL tags, but would fail if the uri-matching was done.
What about functions like urlencode() within a callback, would they be any deterrence or problem (as far as URI standards go)?
Would it be safer to write a full-stack parser? Or, is the time and processing power needed to develop and use such a thing too weighty for something handling several different entries per page?

I know my example is one of many, and is more specific than some. However, don't shirk from providing your own. So, I'm looking for principles and best practices, and general recommendations for XSS-protection in a text-parsing situation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

童话 2024-07-23 23:49:04

测试一些，你要么得到低效的混乱，要么得到带有 XSS 漏洞的糟糕代码

哎呀，是的。我还没有遇到过不存在 XSS 漏洞的 bbcode 实现。

'<a href="\1">\1</a>'

不好：无法对 '<'、'&' 进行 HTML 转义和 '"' 字符。

在此示例中，仅使用 uri/url 验证表达式进行匹配是否更好？或者，最好使用 (.*?) 和回调，然后确定输入是否是有效链接？

我会接受回电。无论如何，您都需要回调来执行 HTML 转义；仅通过简单的字符串替换不可能保证安全。当你这样做的时候，把消毒剂放进去。

回调中的像 urlencode() 这样的函数怎么样

？实际上你需要 htmlspecialchars()。 urlencode() 是关于对查询参数进行编码的，这不是您在这里需要的。

编写全栈解析器会更安全吗？

是的。

bbcode 并不真正适合正则表达式解析，因为它是一种基于递归标记的语言（如 XML，正则表达式也无法解析）。许多 bbcode 漏洞是由嵌套和错误嵌套问题引起的。例如：

[url]http://www.example.com/[i][/url]foo[/i]

可能会出现类似

<a href="http://www.example.com/<i>">foo</i>

在各种 bbcode 实现上生成损坏代码（最多包括 XSS 漏洞）的许多其他陷阱。

我正在寻找原则和最佳实践

如果您需要一种可以进行正则表达式的类似 bbcode 的语言，您需要：

减少可以放入其他标签内的可能标签的数量。任意嵌套实际上不可能支持
使用特殊字符“<” 和“>” HTML 标记分隔符，用于将它们与实际出现在文本中的尖括号区分开来。我使用 ASCII 控制代码（之前已在用户输入阶段过滤掉任何控制字符）。
将在这些控制字符上处理的字符串拆分为这两个控制字符之间的内容，这样您就永远不会让 bbcode 跨度到达标记内部或超出标记边界。
因为你不能让 bbcode 跨度从外到内穿过标签边界，首先处理大块元素，然后向内处理链接，最后使用粗体和斜体。
为了保持理智，一次处理一个块。例如。如果您要开始一个新的
在双换行符上，任何 bbcode 标签都不能跨越两个单独的块。

想要做到正确仍然非常困难。一个合适的解析器更有可能是无懈可击的。

test a few out, you either get an inefficient mess, or you get poor code with XSS holes

Hell yeah. I've not met a bbcode implementation yet that wasn't XSS-vulnerable.

'<a href="\1">\1</a>'

No good: fails to HTML-escape ‘<’, ‘&’ and ‘"’ characters.

Is it better practice, in this example, to only match using a uri/url validation expression? Or, is it better to use (.*?) and a callback, then ascertain whether or not the input is a valid link?

I would take the callback. You need the callback anyway to do the HTML-escaping; it's not possible to be secure with only simple string replacement. Drop the sanitisation in whilst you're doing it.

What about functions like urlencode() within a callback

Nearly; actually you need htmlspecialchars(). urlencode() is about encoding query parameters, which isn't what you need here.

Would it be safer to write a full-stack parser?

Yes.

bbcode is not really amenable to regex parsing, because it's a recursive tag-based language (like XML, which regex also cannot parse). Many bbcode holes are caused by nesting and misnesting problems. For example:

[url]http://www.example.com/[i][/url]foo[/i]

Could come out as something like

<a href="http://www.example.com/<i>">foo</i>

there are many other traps that generate broken code (up to an including XSS holes) on various bbcode implementations.

I'm looking for principles and best practices

If you need a bbcode-like language that you can regex, you need to:

reduce the number of possible tags that can be put inside other tags. Arbitrary nesting isn't really possible to support
use special characters for ‘<’ and ‘>’ HTML tag delimiters, to distinguish them from real angle brackets that should appear as such in the text. I use ASCII control codes (having previously filtered any control characters out at the user input stage).
split the string being processed on these control characters on content between these two control characters, so that you never end up letting a bbcode span reach inside a tag or over a tag boundary.
because you can't have bbcode spans reaching through tag boundaries work from the outside in, doing large block elements first and working inwards to links and finally bold and italic.
for sanity, process a block at a time. eg. If you're starting a new <p> on a double-newline, no bbcode tags can span between the two separate blocks.

It's still damned hard to get right. A proper parser is much more likely to be watertight.

回复收藏 0 原文

~没有更多了~