IDN 主机标签中允许使用哪些 Unicode 字符?

发布于 2024-09-02 01:55:32 字数 595 浏览 1 评论 0原文

我目前正在开发一个“正确的”URI 验证器,目前这一切都归结为主机名验证;剩下的就没那么棘手了。

我陷入了 IDN 主机名标签(即包含 Unicode;可能的 punycode 编码字符串此时已被解码)。

我的第一个想法基本上是为不支持 IDN 的 TLD 建立一个正则表达式,为支持 IDN 的 TLD 建立一个正则表达式。这可能基于Mozilla 启用 IDN 的 TLD 列表。分别, ^[a-zA-Z0-9\-]+$^[a-zA-Z0-9\-\p{L}]+$。然而,这不是理想的情况,因为每个 IDN 注册商都可以决定允许使用哪些字符。

我正在寻找的是各种 TLD 中允许的 Unicode 字符的正确、一致、最新的数据表。看起来我必须自己在俄罗斯和中国的注册站点上找到所有数据(这非常困难)。

因此,在我自己尝试收集所有这些数据之前,我想知道这样的列表是否已经存在。或者有更好的方法、最佳/通用实践等吗? (我希望验证尽可能严格。)

I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.

I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).

My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively,
^[a-zA-Z0-9\-]+$ and ^[a-zA-Z0-9\-\p{L}]+$. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.

What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).

So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

暮凉 2024-09-09 01:55:32

IANA 在 https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties

所有标记为 PVALID 的都可以安全使用。标记为 CONTEXTO 或 CONTEXTJ 的有更多规则要遵循。阅读 RFC5892 (IDNA) 和 RFC6452(更改几个字符的状态)了解所有血淋淋的细节。

IANA maintains a list of all of the codepoints and their status at https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties

All of the ones marked PVALID are safe to use. The ones marked CONTEXTO or CONTEXTJ have more rules to follow. Read RFC5892 (IDNA) and RFC6452 (changing the status of a couple of characters) for all of the gory details.

離人涙 2024-09-09 01:55:32

您不能将所有 Unicode 域转换为 punycode 并验证它吗?由于 DNS 无论如何都不支持真正的 UTF-8 字符,这可能是最好的解决方案。

Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文