匹配 Domain.CCTLD 的正则表达式

发布于 2024-09-08 22:01:48 字数 281 浏览 14 评论 0原文

有谁知道匹配 Domain.CCTLD 的正则表达式？我不需要子域，只想要“原子域”。例如，docs.google.com 不会匹配，但 google.com 会匹配。然而，对于诸如 .co.uk、CCTLD 之类的东西，这会变得复杂。有谁知道解决方案吗？提前致谢。

编辑：我意识到我还必须处理多个子域，例如john.doe.google.co.uk。现在比以往任何时候都更需要解决方案：P。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岛歌少女 2024-09-15 22:01:48

听起来您正在寻找通过公共后缀列表项目提供的信息。

“公共后缀”是互联网用户可以直接注册名称的后缀。公共后缀的一些示例包括“.com”、“.co.uk”和“pvt.k12.wy.us”。公共后缀列表是所有已知公共后缀的列表。

没有一个正则表达式可以合理地匹配公共后缀列表。您将需要实现代码来使用公共后缀列表，或者找到已经这样做的现有库。

回复收藏 0 原文

仄言 2024-09-15 22:01:48

根据您上面的评论，我将重新解释这个问题 - 我们将创建一个与它们匹配的函数，并应用该函数来过滤域名列表，而不是创建一个与它们匹配的正则表达式包括一级域名，例如 google.com、amazon.co.uk。

首先，我们需要一份 TLD 列表。正如 Greg 提到的，公共后缀列表是一个很好的起点。假设您已将列表解析为名为 suffixes 的 Python 数组。如果这不是您喜欢的，请发表评论，我可以添加一些代码来完成它。

suffixes = parse_suffix_list("suffix_list.txt")

现在我们需要代码来识别给定的域名是否与模式 some-name.suffix 匹配：

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false

Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.

First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.

suffixes = parse_suffix_list("suffix_list.txt")

Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:

def is_domain(d):
    for suffix in suffixes:
        if d.endswith(suffix):
            # Get the base domain name without suffix
            base_name = d[0:-(suffix.length + 1)]
            # If it contains '.', it's a subdomain. 
            if not base_name.contains('.'):
                return true
    # If we get here, no matches were found
    return false

回复收藏 0 原文

寂寞陪衬 2024-09-15 22:01:48

我可能会通过获取 TLD 的完整列表并使用它来创建正则表达式来解决这个问题。例如（在 Ruby 中，抱歉，还不是 Pythonista）：

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

我认为在不知道 TLD 的实际列表的情况下，不可能正确区分真正的两部分 TLD 和子域（即：您始终可以构造一个子域如果您知道正则表达式如何工作，看起来就像一个 TLD。）

I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):

tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i

I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)

回复收藏 0 原文

~没有更多了~