如何从 URL 中提取顶级域名 (TLD)

发布于 2024-07-26 17:22:57 字数 388 浏览 2 评论 0原文

如何从 URL 中提取域名(不包括任何子域)?

我最初的简单尝试是:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这适用于 http://www.foo.com,但不适用于 http://www.foo.com.au。 有没有一种方法可以在不使用有关有效 TLD(顶级域名)或国家/地区代码(因为它们会发生变化)的特殊知识的情况下正确执行此操作。

谢谢

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

我最亲爱的 2024-08-02 17:22:57

这是有人在看到这个问题后编写的一个很棒的 python 模块来解决这个问题:
https://github.com/john-kurkowski/tldextract

该模块在 公共后缀列表,由 Mozilla 志愿者维护

引用:

另一方面,

tldextract 知道所有 gTLD [通用顶级域名]
和 ccTLD [国家/地区代码顶级域名] 看起来像
根据公共后缀查找当前活着的
列表
。 因此,给定一个 URL,它从其域中知道其子域,以及它的子域。
来自其国家/地区代码的域名。

Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains]
and ccTLDs [Country Code Top-Level Domains] look like
by looking up the currently living ones according to the Public Suffix
List
. So, given a URL, it knows its subdomain from its domain, and its
domain from its country code.

笑叹一世浮沉 2024-08-02 17:22:57

不,没有“内在”方法可以知道(例如)zap.co.it 是一个子域名(因为意大利的注册商确实出售 co.it 等域名)而 zap.co.uk 不是(因为英国的注册商不出售 co.uk 等域名,而只出售<代码>zap.co.uk)。

您只需使用辅助表(或在线资源)来告诉您哪些 TLD 的行为与英国和澳大利亚的 TLD 非常相似——如果没有这些额外的语义知识,则无法通过仅盯着字符串来推断出这一点(当然它可以最终会改变,但如果你能找到一个好的在线资源,那么该资源也会相应地改变,希望如此!-)。

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

绾颜 2024-08-02 17:22:57

使用此有效顶级域名文件 其他人在 Mozilla 网站上找到:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果为:

abcde.co.uk

如果有人让我知道上面的哪些部分可以用更Pythonic的方式重写,我将不胜感激。 例如,必须有一种更好的方法来迭代 last_i_elements 列表,但我想不出一种方法。 我也不知道 ValueError 是否是最好的提高。 评论?

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

小草泠泠 2024-08-02 17:22:57

使用 python tld

https://pypi.python.org/pypi/tld

安装

pip install tld

从给定的 URL 中获取字符串形式的 TLD 名称

from tld import get_tld
print get_tld("http://www.google.co.uk") 

英国

或无协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

英国

以对象形式获取 TLD

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

从给定的 URL 中以字符串形式获取一级域名

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk") 

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'
不必你懂 2024-08-02 17:22:57

在更新所有新的 get_tld 之前,我会从错误中提取 tld。 当然,这是糟糕的代码,但它可以工作。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e
荒岛晴空 2024-08-02 17:22:57

我是这样处理的:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)
微暖i 2024-08-02 17:22:57

在Python中,我曾经使用tldextract,直到它失败并出现像www.mybrand.sa.com这样的url,将其解析为subdomain='order.mybrand', domain ='sa',后缀='com'

所以最后,我决定编写这个方法

重要提示:这仅适用于其中包含子域的网址。 这并不意味着取代更高级的库,例如 tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文