匹配 Domain.CCTLD 的正则表达式

发布于 2024-09-08 22:01:48 字数 281 浏览 9 评论 0原文

有谁知道匹配 Domain.CCTLD 的正则表达式?我不需要子域,只想要“原子域”。例如,docs.google.com 不会匹配,但 google.com 会匹配。然而,对于诸如 .co.uk、CCTLD 之类的东西,这会变得复杂。有谁知道解决方案吗?提前致谢。

编辑:我意识到我还必须处理多个子域,例如john.doe.google.co.uk。现在比以往任何时候都更需要解决方案:P。

Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.

EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

岛歌少女 2024-09-15 22:01:48

听起来您正在寻找通过 公共后缀列表 项目提供的信息。

“公共后缀”是互联网用户可以直接注册名称的后缀。公共后缀的一些示例包括“.com”、“.co.uk”和“pvt.k12.wy.us”。公共后缀列表是所有已知公共后缀的列表。

没有一个正则表达式可以合理地匹配公共后缀列表。您将需要实现代码来使用公共后缀列表,或者找到已经这样做的现有库。

It sounds like you are looking for the information available through the Public Suffix List project.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.

There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.

仄言 2024-09-15 22:01:48

根据您上面的评论,我将重新解释这个问题 - 我们将创建一个与它们匹配的函数,并应用该函数来过滤域名列表,而不是创建一个与它们匹配的正则表达式包括一级域名,例如 google.com、amazon.co.uk。

首先,我们需要一份 TLD 列表。正如 Greg 提到的,公共后缀列表是一个很好的起点。假设您已将列表解析为名为 suffixes 的 Python 数组。如果这不是您喜欢的,请发表评论,我可以添加一些代码来完成它。

suffixes = parse_suffix_list("suffix_list.txt")

现在我们需要代码来识别给定的域名是否与模式 some-name.suffix 匹配:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false
寂寞陪衬 2024-09-15 22:01:48

我可能会通过获取 TLD 的完整列表并使用它来创建正则表达式来解决这个问题。例如(在 Ruby 中,抱歉,还不是 Pythonista):

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

我认为在不知道 TLD 的实际列表的情况下,不可能正确区分真正的两部分 TLD 和子域(即:您始终可以构造一个子域如果您知道正则表达式如何工作,看起来就像一个 TLD。)

I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文