“不区分大小写”是什么意思? RFC 3986 中关于非英语字符的意思是什么?

发布于 2024-12-09 18:55:10 字数 937 浏览 0 评论 0 原文

RFC 3986 指定 URI 的 主机组件 是 '不区分大小写'。但是,它没有指定“不区分大小写”在 UCS 或 UTF-8 字符方面的含义。

RFC 中给出的示例(例如“> 相当于 >") 允许我们推断“不区分大小写”意味着至少字符 AZ 被认为等于 UTF-8 字符集中位于其前面的字符 32,即 az。但是,没有提及应如何处理此范围之外的字符。因此,给定一个非编码、非规范化的注册名称 www.OLÉ.com ,我看到 RFC 允许三种潜在的标准化形式:

  1. 小写为 www.olé.com然后百分比编码为 www.ol%E9.com
  2. 仅小写 AZ 字符为 www.olÉ.com< /a> 然后百分比编码为 www.ol%C9.com
  3. 百分比编码为 www.OL%C9.com,然后将非百分比编码部分小写为 www.ol%C9.com,产生与以下相同的结果2.

所以问题是:哪个是正确的?如果是情况 1,那么什么定义哪些字符被视为大写,哪些字符被视为小写(以及哪些字符没有大小写)?

RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.

Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:

  1. Lower case to www.olé.com then percent encode to www.ol%E9.com
  2. Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
  3. Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.

So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

好菇凉咱不稀罕他 2024-12-16 18:55:10

由 DNS 解析的主机名始终为小写。

DNS 主机名中不可能包含 UTF-8 字符 (RFC 1123),然而,已经针对“国际化域名”制定了解决办法。此解决方法通常称为 punycode

Punycode 使非 ASCII 字符可以用 ASCII 字符表示。

非 ASCII 字符由主机名标签中允许的 ASCII 字符(字母、数字和连字符)表示。

-- https://www.ietf.org/rfc/rfc3492.txt


作为对于您在问题中提供的示例 (www.olé.com),将解析的域名 www.ol%E9.com。

如果您的域名中出现百分号,则意味着您对主机名进行了 URL 编码,但这是不正确的,至少对于解析而言是不正确的。

例如,具有如下所示的 a 标记将正常工作:

<a href="//www.ol%C3%A9.com">Click Here</a>

但是,DNS 服务器不会解析 www.ol%C3%A9.com,而是相反,转换后的域名为 punycode:

示例

www.ol%C3%A9.com

变为

www.olé.com

,在 punycode 中翻译为:

www.xn--ol-cja.com

Web 浏览器通常会将大写字符转换为小写版本。例如,www.olé.comwww.olÉ.com 均转换为相同的 DNS 主机名 (www.xn--ol-cja.com< /code>),因为 www.olÉ.com 小写为 www.olé.com

我推荐两种工具来检查 IDN 域名,以了解域名经过 punycode 翻译后的样子:

Verisign 的 IDN 工具要严格得多。尝试使用 www.olÉ.com 作为输入来了解我的意思。

IDNA(应用程序国际化域名)的规则很复杂,但有两个主要的 RFC 值得一看:

  • 应用程序国际化域名 (IDNA):背景、解释和基本原理
    https://www .rfc-editor.org/rfc/rfc5894
  • 应用程序的 Unicode 代码点和国际化域名
    https://www.rfc-editor.org/rfc/rfc5892

rfc5894 第 3.1.3 节 指定在以下情况下可能不允许使用字符:

  • 该字符是大写形式或其他形式
    通过 Unicode 大小写折叠映射到另一个字符。

Hostnames resolved by DNS are always lowercase.

It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.

Punycode enables non ASCII characters to be represented by ASCII characters.

non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).

-- https://www.ietf.org/rfc/rfc3492.txt

As for the example that you have provided in your question (www.olé.com), the domain name that would be resolved is not www.ol%E9.com.

If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.

For example, it will work correctly to have an a tag that looks like this:

<a href="//www.ol%C3%A9.com">Click Here</a>

However, the DNS server will not resolve www.ol%C3%A9.com, but rather, the converted domain name as punycode:

Example

www.ol%C3%A9.com

becomes

www.olé.com

which in punycode translates to:

www.xn--ol-cja.com

Web browsers will generally convert uppercase characters to the lowercase version. For example, both www.olé.com and www.olÉ.com translate to the same DNS hostname (www.xn--ol-cja.com), because www.olÉ.com was lowercased to www.olé.com.

I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:

Verisign's IDN tool is much stricter. Try both tools with www.olÉ.com as the input to see what I mean.

The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:

  • Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale
    https://www.rfc-editor.org/rfc/rfc5894
  • The Unicode Code Points and Internationalized Domain Names for Applications
    https://www.rfc-editor.org/rfc/rfc5892

rfc5894 section 3.1.3 specifies that characters may not be allowed if:

  • The character is an uppercase form or some other form that is
    mapped to another character by Unicode case folding.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文