当前位置：文江博客话题详情

使用正则表达式提取主机名的 TLD

发布于 2024-09-13 08:58:28 字数 1266 浏览 12 评论 0原文

提取主机名顶级域的准确表示很复杂，因为每个顶级域注册机构都可以自由制定自己的有关如何颁发域以及定义哪些子域的策略。由于似乎没有任何标准机构来协调这些或制定标准，这使得确定实际 TLD 成为一件有些复杂的事情。

由于 Web 浏览器仅将 Cookie 分配给注册域，并且出于安全原因必须保持警惕，确保不能在更广泛的级别上分配 Cookie，因此这些浏览器通常包含某种形式的所有已知 TLD 的数据库。我发现 Firefox 有一个相当完整的数据库：

http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/ effective_tld_names.dat

我有两个具体问题：

虽然转换这个相当简单列出到正则表达式中，是否有一个 gem 或参考正则表达式比滚动自己的解决方案更好？ tld gem 仅提供根级域的国家/地区级信息。
还有比 Firefox TLD 列表更好的参考吗？所有本地 Google 网站均按此规范正确解析，但这并不是详尽无遗的测试。

如果那里没有任何东西，那么有人对执行此类操作的宝石感兴趣吗？此类内容应该存在于 URI 模块中，但显然缺失了。

以下是我将此文件转换为 Ruby 中可用的正则表达式的看法：

TLD_SPEC = Regexp.new(
  '[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
  ].split(/\n/).collect do |line|
    line.sub(%r[//.*], '').sub(/\s+$/, '')
  end.reject(&:blank?).collect do |s|
    Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
  end.join('|') + ')$'
)

原文

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.

Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:

http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat

I have two specific questions:

Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.

If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.

Here's my take on converting this file into a usable Regexp in Ruby:

TLD_SPEC = Regexp.new(
  '[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
  ].split(/\n/).collect do |line|
    line.sub(%r[//.*], '').sub(/\s+$/, '')
  end.reject(&:blank?).collect do |s|
    Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
  end.join('|') + ')

)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦中的蝴蝶 2024-09-20 08:58:28

您可能需要考虑使用 Addressable 来看看它是否满足您的需要。它比 Ruby 的默认 URI 库有更多的功能。特别是，它的模板功能可能会对您有所帮助。

来自文档：

Addressable 是 URI 实现的替代品，它是 Ruby 标准库的一部分。它更符合相关 RFC，并添加了对 IRI 和 URI 模板的支持。此外，它还提供对 URI 模板的广泛支持。

随着最近新顶级域名 (TLD) 的开放，这将在一段时间内成为一场噩梦。查看右侧的相关列表，看看有多少人正在尝试寻找解决方案。匹配 Domain.CCTLD 的正则表达式建议使用函数将其分解为更小的步骤，这就是我会做的。尝试使用正则表达式来执行此操作假设您可以在一个表达式中完成所有操作，这开始有点像使用正则表达式来解析 XML 或 HTML。对于单个模式或至少对于单个可维护模式来说，目标过于摇摆不定。

该答案提到了公共 TLD 列表。使用那里的信息，您可以快速使用 Ruby 的 Regexp.escape 和 Regexp.union 方法动态构建相当好的正则表达式。如果我们有 Perl 的 Regexp::Assemble 模块可用，那就太好了，但我们没有，所以 union 必须这样做。（参见“