我可以改进此正则表达式检查以获取有效域名吗?
于是,我就一直致力于这个域名正则表达式的研究。 到目前为止,它似乎选择了带有 SLD 和 TLD(带有可选的 ccTLD)的域名,但 TLD 列表存在重复。 这可以进一步重构吗?
params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)
So, I have been working on this domain name regular expression. So far, it seems to pick up domain names with SLDs and TLDs (with the optional ccTLD), but there is duplication of the TLD listing. Can this be refactored any further?
params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
请,请,请不要使用像这样的固定且极其复杂的正则表达式来匹配已知域名。
TLD 列表不是静态的,特别是在 ICANN 寻求简化新 gTLD 流程的情况下。 甚至 ccTLD 列表有时也会发生变化!
查看 http://publicsuffix.org/ 中提供的列表,并编写一些能够下载和解析的代码那个列表代替。
Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.
The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!
Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.
下载此:http://data.iana.org/TLD/tlds -alpha-by-domain.txt
示例用法(Python 中):
您可以将域列表构建从验证函数中剔除,以提高性能。
Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Example usage (in Python):
You can factor the domain-list-building out of the validate function to help performance.
我可能对域名了解不够。 但为什么像“foo.info.com”这样的域名会被匹配呢? 在这种特殊情况下,域名似乎是“info.com”。
您可能想确保名称以 [az\d] 开头。 我认为您不能注册以破折号开头的域名?
I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.
And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?
正如您所写的,TLD 部分是等效的,但比
(\.){1,2}
长,但我确信它可以修复重复问题...编辑:是的,不,这是可能的,但本质上是一个非常慢的强力列表来处理我认为的重复。 更简单、更快速地将可能的 TLD 和 SLD+国家/地区对放入一个大哈希图中,并根据该哈希图检查子字符串。
Well as you have it written, the TLD part is equivalent but longer than
(\.<tldpart>){1,2}
but I'm sure it could be fixed for duplication...edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.
我建议从 RFC 1035 中列出的规则开始,然后逆向工作——但前提是你真的真的需要从头开始做这件事。 域正则表达式模式必须是(可以说仅次于电子邮件地址正则表达式模式)最常见的东西。 我会查看该网站 regexlib.com 并浏览其他人所做的事情。
I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.
您可以将正则表达式构建为字符串,然后执行 Regexp.new(string)。
You can build up the regex as a string and then do Regexp.new(string).