使用正则表达式提取主机名的 TLD
提取主机名顶级域的准确表示很复杂,因为每个顶级域注册机构都可以自由制定自己的有关如何颁发域以及定义哪些子域的策略。由于似乎没有任何标准机构来协调这些或制定标准,这使得确定实际 TLD 成为一件有些复杂的事情。
由于 Web 浏览器仅将 Cookie 分配给注册域,并且出于安全原因必须保持警惕,确保不能在更广泛的级别上分配 Cookie,因此这些浏览器通常包含某种形式的所有已知 TLD 的数据库。我发现 Firefox 有一个相当完整的数据库:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/ effective_tld_names.dat
我有两个具体问题:
虽然转换这个相当简单列出到正则表达式中,是否有一个 gem 或参考正则表达式比滚动自己的解决方案更好? tld gem 仅提供根级域的国家/地区级信息。
还有比 Firefox TLD 列表更好的参考吗?所有本地 Google 网站均按此规范正确解析,但这并不是详尽无遗的测试。
如果那里没有任何东西,那么有人对执行此类操作的宝石感兴趣吗?此类内容应该存在于 URI 模块中,但显然缺失了。
以下是我将此文件转换为 Ruby 中可用的正则表达式的看法:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')$'
)
Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.
Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat
I have two specific questions:
Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.
If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.
Here's my take on converting this file into a usable Regexp in Ruby:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能需要考虑使用 Addressable 来看看它是否满足您的需要。它比 Ruby 的默认 URI 库有更多的功能。特别是,它的模板功能可能会对您有所帮助。
来自文档:
随着最近新顶级域名 (TLD) 的开放,这将在一段时间内成为一场噩梦。查看右侧的相关列表,看看有多少人正在尝试寻找解决方案。 匹配 Domain.CCTLD 的正则表达式 建议使用函数将其分解为更小的步骤,这就是我会做的。尝试使用正则表达式来执行此操作假设您可以在一个表达式中完成所有操作,这开始有点像使用正则表达式来解析 XML 或 HTML。对于单个模式或至少对于单个可维护模式来说,目标过于摇摆不定。
该答案提到了公共 TLD 列表。使用那里的信息,您可以快速使用 Ruby 的
Regexp.escape
和Regexp.union
方法动态构建相当好的正则表达式。如果我们有 Perl 的 Regexp::Assemble 模块可用,那就太好了,但我们没有,所以union
必须这样做。 (参见“You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.
From the docs:
With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.
That answer mentions the public TLD list. Using the information there you could quickly use Ruby's
Regexp.escape
andRegexp.union
methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't sounion
will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)这里还有另一个平面文件数据库 http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
也许你可以将两者结合起来,然后上传它到 OData.org、github、sourceforge 等地方。
There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.
有一个名为 public-suffix-list 的 gem,它提供了对更正式版本的访问Mozilla 上市。
There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.