匹配 Domain.CCTLD 的正则表达式
有谁知道匹配 Domain.CCTLD 的正则表达式?我不需要子域,只想要“原子域”。例如,docs.google.com
不会匹配,但 google.com
会匹配。然而,对于诸如 .co.uk
、CCTLD 之类的东西,这会变得复杂。有谁知道解决方案吗?提前致谢。
编辑:我意识到我还必须处理多个子域,例如john.doe.google.co.uk
。现在比以往任何时候都更需要解决方案:P。
Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com
doesn't get matched, but google.com
does. However, this gets complicated with stuff like .co.uk
, CCTLDs. Does anyone know a solution? Thanks in advance.
EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk
. Need a solution now more than ever :P.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
听起来您正在寻找通过 公共后缀列表 项目提供的信息。
没有一个正则表达式可以合理地匹配公共后缀列表。您将需要实现代码来使用公共后缀列表,或者找到已经这样做的现有库。
It sounds like you are looking for the information available through the Public Suffix List project.
There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.
根据您上面的评论,我将重新解释这个问题 - 我们将创建一个与它们匹配的函数,并应用该函数来过滤域名列表,而不是创建一个与它们匹配的正则表达式包括一级域名,例如 google.com、amazon.co.uk。
首先,我们需要一份 TLD 列表。正如 Greg 提到的,公共后缀列表是一个很好的起点。假设您已将列表解析为名为
suffixes
的 Python 数组。如果这不是您喜欢的,请发表评论,我可以添加一些代码来完成它。现在我们需要代码来识别给定的域名是否与模式 some-name.suffix 匹配:
Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.
First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called
suffixes
. If this isn't something your comfortable with, comment and I can add some code that will do it.Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:
我可能会通过获取 TLD 的完整列表并使用它来创建正则表达式来解决这个问题。例如(在 Ruby 中,抱歉,还不是 Pythonista):
我认为在不知道 TLD 的实际列表的情况下,不可能正确区分真正的两部分 TLD 和子域(即:您始终可以构造一个子域如果您知道正则表达式如何工作,看起来就像一个 TLD。)
I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):
I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)