我可以改进此正则表达式检查以获取有效域名吗?

发布于 2024-07-10 14:29:48 字数 1128 浏览 6 评论 0原文

于是,我就一直致力于这个域名正则表达式的研究。 到目前为止,它似乎选择了带有 SLD 和 TLD(带有可选的 ccTLD)的域名,但 TLD 列表存在重复。 这可以进一步重构吗?

params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)

So, I have been working on this domain name regular expression. So far, it seems to pick up domain names with SLDs and TLDs (with the optional ccTLD), but there is duplication of the TLD listing. Can this be refactored any further?

params[:domain_name].downcase.strip.match(/^[a-z0-9\-]{2,63}
\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
(m[acdghklmnopqrstuvwxyz]|me|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])
(\.((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|
(c[acdfghiklmnorsuvxyz]|cat|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|
(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|
(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|
m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|
(n[acefgilopruz]|name|net)|(om|org)|
(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|
(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))?$/)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

陌上芳菲 2024-07-17 14:29:49

请,请,请不要使用像这样的固定且极其复杂的正则表达式来匹配已知域名。

TLD 列表不是静态的,特别是在 ICANN 寻求简化新 gTLD 流程的情况下。 甚至 ccTLD 列表有时也会发生变化!

查看 http://publicsuffix.org/ 中提供的列表,并编写一些能够下载和解析的代码那个列表代替。

Please, please, please don't use a fixed and horribly complicated regex like this to match for known domain names.

The list of TLDs is not static, particularly with ICANN looking at a streamlined process for new gTLDs. Even the list of ccTLDs changes sometimes!

Have a look at the list available from http://publicsuffix.org/ and write some code that's able to download and parse that list instead.

单调的奢华 2024-07-17 14:29:49

下载此:http://data.iana.org/TLD/tlds -alpha-by-domain.txt

示例用法(Python 中):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
                      for line in open('domains.txt') 
                      if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)

您可以将域列表构建从验证函数中剔除,以提高性能。

% ('|'.join(valid_domains),)) return True if r.match(domain.upper()) else False print validate('stackoverflow.com') print validate('omnom.nom')

您可以将域列表构建从验证函数中剔除,以提高性能。

Download this: http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Example usage (in Python):

import re
def validate(domain):
    valid_domains = [ line.upper().replace('.', '\.').strip() 
                      for line in open('domains.txt') 
                      if line[0] != '#' ]
    r = re.compile(r'^[A-Z0-9\-]{2,63}\.(%s)

You can factor the domain-list-building out of the validate function to help performance.

% ('|'.join(valid_domains),)) return True if r.match(domain.upper()) else False print validate('stackoverflow.com') print validate('omnom.nom')

You can factor the domain-list-building out of the validate function to help performance.

自在安然 2024-07-17 14:29:49

我可能对域名了解不够。 但为什么像“foo.info.com”这样的域名会被匹配呢? 在这种特殊情况下,域名似乎是“info.com”。

您可能想确保名称以 [az\d] 开头。 我认为您不能注册以破折号开头的域名?

I don't know enough about domain names probably. But why is domains like "foo.info.com" matched? It seems that the domain name is "info.com" in that particular case.

And you might want to make sure the name starts with [a-z\d]. I don't think you can register a domain that starts with a dash?

谁把谁当真 2024-07-17 14:29:49

正如您所写的,TLD 部分是等效的,但比 (\.){1,2} 长,但我确信它可以修复重复问题...

编辑:是的,不,这是可能的,但本质上是一个非常慢的强力列表来处理我认为的重复。 更简单、更快速地将可能的 TLD 和 SLD+国家/地区对放入一个大哈希图中,并根据该哈希图检查子字符串。

Well as you have it written, the TLD part is equivalent but longer than (\.<tldpart>){1,2} but I'm sure it could be fixed for duplication...

edit: yech, no, it would be possible but essentially a very slow brute force list to handle the duplications I think. Simpler and faster to put the possible TLD and SLD+country pairs in a big hashmap and check the substring against that.

白日梦 2024-07-17 14:29:49

我建议从 RFC 1035 中列出的规则开始,然后逆向工作——但前提是你真的真的需要从头开始做这件事。 域正则表达式模式必须是(可以说仅次于电子邮件地址正则表达式模式)最常见的东西。 我会查看该网站 regexlib.com 并浏览其他人所做的事情。

I'd recommend starting with the rules laid out in RFC 1035, and then working backwards -- but only if you really really really need to do this from scratch. A domain regex pattern has got to be (arguable second only to email address regex patterns) the most common thing out there. I would check out the site regexlib.com and browse through what other folks have done.

夏の忆 2024-07-17 14:29:49

您可以将正则表达式构建为字符串,然后执行 Regexp.new(string)。

You can build up the regex as a string and then do Regexp.new(string).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文