HttpUtility.UrlEncode 是否符合“x-www-form-urlencoded”的规范？

发布于 2024-09-08 09:51:04 字数 944 浏览 5 评论 0原文

URLEncode 转换字符如下：
空格 ( ) 会转换为加号 (+)。
非字母数字字符将转义为其十六进制表示形式。

application/x-www-form-urlencoded
这是默认内容类型。使用此内容类型提交的表单必须按如下方式编码：
控件名称和值被转义。空格字符被替换按“+”，然后保留字符按照 RFC1738 中的描述进行转义，第 2.2 节：非字母数字字符被替换为 '%HH'，a 百分号和两个十六进制代表 ASCII 码的数字角色。换行符是表示为“CR LF”对（即 “%0D%0A”）。
控件名称/值按照它们在列表中出现的顺序列出。文档。名字是分开的 '=' 和名称/值对的值彼此之间用“&”分隔。

我的问题是，有没有人做过确定 URLEncode 是否生成有效的 x-www-form-urlencoded 数据的工作？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

倾听心声的旋律 2024-09-15 09:51:04

好吧，您链接到的文档适用于 IIS 6 Server.UrlEncode，但您的标题似乎询问有关 .NET System.Web.HttpUtility.UrlEncode。使用Reflector这样的工具，我们可以看到后者的实现并确定它是否符合W3C规范。

这是最终调用的编码例程（注意，它是为字节数组定义的，以及采用字符串的其他重载最终将这些字符串转换为字节数组并调用此方法）。您可以为每个控件名称和值调用此方法（以避免转义用作分隔符的保留字符 = &）。

protected internal virtual byte[] UrlEncode(byte[] bytes, int offset, int count)
{
    if (!ValidateUrlEncodingParameters(bytes, offset, count))
    {
        return null;
    }
    int num = 0;
    int num2 = 0;
    for (int i = 0; i < count; i++)
    {
        char ch = (char) bytes[offset + i];
        if (ch == ' ')
        {
            num++;
        }
        else if (!HttpEncoderUtility.IsUrlSafeChar(ch))
        {
            num2++;
        }
    }
    if ((num == 0) && (num2 == 0))
    {
        return bytes;
    }
    byte[] buffer = new byte[count + (num2 * 2)];
    int num4 = 0;
    for (int j = 0; j < count; j++)
    {
        byte num6 = bytes[offset + j];
        char ch2 = (char) num6;
        if (HttpEncoderUtility.IsUrlSafeChar(ch2))
        {
            buffer[num4++] = num6;
        }
        else if (ch2 == ' ')
        {
            buffer[num4++] = 0x2b;
        }
        else
        {
            buffer[num4++] = 0x25;
            buffer[num4++] = (byte) HttpEncoderUtility.IntToHex((num6 >> 4) & 15);
            buffer[num4++] = (byte) HttpEncoderUtility.IntToHex(num6 & 15);
        }
    }
    return buffer;
}

public static bool IsUrlSafeChar(char ch)
{
    if ((((ch >= 'a') && (ch <= 'z')) || ((ch >= 'A') && (ch <= 'Z'))) || ((ch >= '0') && (ch <= '9')))
    {
        return true;
    }
    switch (ch)
    {
        case '(':
        case ')':
        case '*':
        case '-':
        case '.':
        case '_':
        case '!':
            return true;
    }
    return false;
}

该例程的第一部分计算需要替换的字符数（空格和非 URL 安全字符）。该例程的第二部分分配一个新的缓冲区并执行替换：

Url 安全字符保持原样： az AZ 0-9 ()*-._!
空格转换为加号
所有其他字符转换为 %HH

RFC1738 状态（强调我的）：

因此，仅包含字母数字、特殊字符“$-_.+!*'(),”和
可以用于其保留目的的保留字符
URL 中未编码。
另一方面，不需要编码的字符
（包括字母数字）可以在特定于方案的范围内进行编码
URL 的一部分，只要它们不被用于保留
目的。

UrlEncode 允许的 Url 安全字符集是 RFC1738 中定义的特殊字符的子集。也就是说，字符 $, 丢失，并且将由 UrlEncode 编码，即使规范说它们是安全的。由于它们可以未编码地使用（而不是必须），因此它仍然符合对它们进行编码的规范（第二段明确指出了这一点）。

对于换行符，如果输入具有 CR LF 序列，则该序列将被转义 %0D%0A。但是，如果输入只有 LF ，那么它将被转义 %0A （因此此例程中没有换行符的标准化）。

底线：它满足规范，同时另外编码$,，并且调用者负责在输入中提供适当规范化的换行符。

Well, the documentation you linked to is for IIS 6 Server.UrlEncode, but your title seems to ask about .NET System.Web.HttpUtility.UrlEncode. Using a tool like Reflector, we can see the implementation of the latter and determine if it meets the W3C spec.

Here is the encoding routine that is ultimately called (note, it is defined for an array of bytes, and other overloads that take strings eventually convert those strings to byte arrays and call this method). You would call this for each control name and value (to avoid escaping the reserved characters = & used as separators).

protected internal virtual byte[] UrlEncode(byte[] bytes, int offset, int count)
{
    if (!ValidateUrlEncodingParameters(bytes, offset, count))
    {
        return null;
    }
    int num = 0;
    int num2 = 0;
    for (int i = 0; i < count; i++)
    {
        char ch = (char) bytes[offset + i];
        if (ch == ' ')
        {
            num++;
        }
        else if (!HttpEncoderUtility.IsUrlSafeChar(ch))
        {
            num2++;
        }
    }
    if ((num == 0) && (num2 == 0))
    {
        return bytes;
    }
    byte[] buffer = new byte[count + (num2 * 2)];
    int num4 = 0;
    for (int j = 0; j < count; j++)
    {
        byte num6 = bytes[offset + j];
        char ch2 = (char) num6;
        if (HttpEncoderUtility.IsUrlSafeChar(ch2))
        {
            buffer[num4++] = num6;
        }
        else if (ch2 == ' ')
        {
            buffer[num4++] = 0x2b;
        }
        else
        {
            buffer[num4++] = 0x25;
            buffer[num4++] = (byte) HttpEncoderUtility.IntToHex((num6 >> 4) & 15);
            buffer[num4++] = (byte) HttpEncoderUtility.IntToHex(num6 & 15);
        }
    }
    return buffer;
}

public static bool IsUrlSafeChar(char ch)
{
    if ((((ch >= 'a') && (ch <= 'z')) || ((ch >= 'A') && (ch <= 'Z'))) || ((ch >= '0') && (ch <= '9')))
    {
        return true;
    }
    switch (ch)
    {
        case '(':
        case ')':
        case '*':
        case '-':
        case '.':
        case '_':
        case '!':
            return true;
    }
    return false;
}

The first part of the routine counts the number of characters that need to be replaced (spaces and non- url safe characters). The second part of the routine allocates a new buffer and performs replacements:

Url Safe Characters are kept as is: a-z A-Z 0-9 ()*-._!
Spaces are converted to plus signs
All other characters are converted to %HH

RFC1738 states (emphasis mine):

Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.

The set of Url Safe Characters allowed by UrlEncode is a subset of the special characters defined in RFC1738. Namely, the characters $, are missing and will be encoded by UrlEncode even when the spec says they are safe. Since they may be used unencoded (and not must), it still meets the spec to encode them (and the second paragraph states that explicitly).

With respect to line breaks, if the input has a CR LF sequence then that will be escaped %0D%0A. However, if the input has only LF then that will be escaped %0A (so there is no normalization of line breaks in this routine).

Bottom Line: It meets the specification while additionally encoding $,, and the caller is responsible for providing suitably normalized line breaks in the input.

回复收藏 0 原文

~没有更多了~