为什么调用 mb_convert_encoding 来清理文本？

发布于 2024-08-04 16:54:33 字数 837 浏览 7 评论 0原文

这是参考这个（优秀）答案。他指出，在 PHP 中转义输入的最佳解决方案是调用 mb_convert_encoding< /a> 后跟 html_entities。

但是为什么要使用相同的传入和传出参数（UTF8）来调用 mb_convert_encoding 呢？

摘自原始答案：

即使您在 HTML 标记之外使用 htmlspecialchars($string)，您仍然容易受到多字节字符集攻击向量的攻击。</p>
最有效的方法是使用 mb_convert_encoding 和 htmlentities 的组合，如下所示。
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');

这有我所缺少的某种好处吗？

原文

This is in reference to this (excellent) answer. He states that the best solution for escaping input in PHP is to call mb_convert_encoding followed by html_entities.

But why exactly would you call mb_convert_encoding with the same to and from parameters (UTF8)?

Excerpt from the original answer:

Even if you use htmlspecialchars($string) outside of HTML tags, you are still vulnerable to multi-byte charset attack vectors.
The most effective you can be is to use the a combination of mb_convert_encoding and htmlentities as follows.
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
$str = htmlentities($str, ENT_QUOTES, 'UTF-8');

Does this have some sort of benefit I'm missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬色的秋千 2024-08-11 16:54:33

并非所有二进制数据都是有效的 UTF8。使用相同的源/目标编码调用 mb_convert_encoding 是一种确保处理给定编码的正确编码字符串的简单方法。

rfc2279：

另一个例子可能是一个解析器
禁止八位位组序列 2F 2E 2E 2F (“/../”)，但允许
非法八位位组序列 2F C0 AE 2E 2F。

通过检查二进制表示形式可能会更容易理解这一点：

110xxxxx 10xxxxxx # header bits used by the encoding
11000000 10101110 # C0 AE
         00101110 #    2E the '.' character

换句话说：(C0 AE - header-bits) == '.'

正如引用的文本指出的，C0 AE 不是有效的 UTF8八位字节序列，因此 mb_convert_encoding 会将其从字符串中删除（或将其转换为 '.' 或其他内容:-)。

Not all binary data is valid UTF8. Invoking mb_convert_encoding with the same from/to encodings is a simple way to ensure that one is dealing with a correctly encoded string for the given encoding.

A way to exploit the omission of UTF8 validation is described in section 6 (security considerations) in rfc2279:

Another example might be a parser which
prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
illegal octet sequence 2F C0 AE 2E 2F.

This may be more easily understood by examining the binary representation:

110xxxxx 10xxxxxx # header bits used by the encoding
11000000 10101110 # C0 AE
         00101110 #    2E the '.' character

In other words: (C0 AE - header-bits) == '.'

As the quoted text points out, C0 AE is not a valid UTF8 octet sequence, so mb_convert_encoding would have removed it from the string (or translated it to '.', or something else :-).

回复收藏 0 原文

~没有更多了~