RFC 3986 指定 URI 的 主机组件 是 '不区分大小写'。但是,它没有指定“不区分大小写”在 UCS 或 UTF-8 字符方面的含义。
RFC 中给出的示例(例如“> 相当于 >") 允许我们推断“不区分大小写”意味着至少字符 AZ 被认为等于 UTF-8 字符集中位于其前面的字符 32,即 az。但是,没有提及应如何处理此范围之外的字符。因此,给定一个非编码、非规范化的注册名称 www.OLÉ.com ,我看到 RFC 允许三种潜在的标准化形式:
- 小写为 www.olé.com然后百分比编码为 www.ol%E9.com
- 仅小写 AZ 字符为 www.olÉ.com< /a> 然后百分比编码为 www.ol%C9.com
- 百分比编码为 www.OL%C9.com,然后将非百分比编码部分小写为 www.ol%C9.com,产生与以下相同的结果2.
所以问题是:哪个是正确的?如果是情况 1,那么什么定义哪些字符被视为大写,哪些字符被视为小写(以及哪些字符没有大小写)?
RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.
Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/
> is equivalent to <http://www.example.com/
>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:
- Lower case to www.olé.com then percent encode to www.ol%E9.com
- Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
- Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.
So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?
发布评论
评论(1)
由 DNS 解析的主机名始终为小写。
DNS 主机名中不可能包含 UTF-8 字符 (RFC 1123),然而,已经针对“国际化域名”制定了解决办法。此解决方法通常称为 punycode。
Punycode 使非 ASCII 字符可以用 ASCII 字符表示。
作为对于您在问题中提供的示例 (
www.olé.com
),将解析的域名不 www.ol%E9.com。如果您的域名中出现百分号,则意味着您对主机名进行了 URL 编码,但这是不正确的,至少对于解析而言是不正确的。
例如,具有如下所示的
a
标记将正常工作:但是,DNS 服务器不会解析
www.ol%C3%A9.com
,而是相反,转换后的域名为 punycode:示例
变为
,在 punycode 中翻译为:
Web 浏览器通常会将大写字符转换为小写版本。例如,
www.olé.com
和www.olÉ.com
均转换为相同的 DNS 主机名 (www.xn--ol-cja.com< /code>),因为
www.olÉ.com
小写为www.olé.com
。我推荐两种工具来检查 IDN 域名,以了解域名经过 punycode 翻译后的样子:
Verisign 的 IDN 工具要严格得多。尝试使用
www.olÉ.com
作为输入来了解我的意思。IDNA(应用程序国际化域名)的规则很复杂,但有两个主要的 RFC 值得一看:
https://www .rfc-editor.org/rfc/rfc5894
https://www.rfc-editor.org/rfc/rfc5892
rfc5894 第 3.1.3 节 指定在以下情况下可能不允许使用字符:
Hostnames resolved by DNS are always lowercase.
It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.
Punycode enables non ASCII characters to be represented by ASCII characters.
As for the example that you have provided in your question (
www.olé.com
), the domain name that would be resolved is not www.ol%E9.com.If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.
For example, it will work correctly to have an
a
tag that looks like this:However, the DNS server will not resolve
www.ol%C3%A9.com
, but rather, the converted domain name as punycode:Example
becomes
which in punycode translates to:
Web browsers will generally convert uppercase characters to the lowercase version. For example, both
www.olé.com
andwww.olÉ.com
translate to the same DNS hostname (www.xn--ol-cja.com
), becausewww.olÉ.com
was lowercased towww.olé.com
.I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:
Verisign's IDN tool is much stricter. Try both tools with
www.olÉ.com
as the input to see what I mean.The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:
https://www.rfc-editor.org/rfc/rfc5894
https://www.rfc-editor.org/rfc/rfc5892
rfc5894 section 3.1.3 specifies that characters may not be allowed if: