如何从 URL 中提取顶级域名 (TLD)

发布于 2024-07-26 17:22:57 字数 388 浏览 2 评论 0原文

如何从 URL 中提取域名（不包括任何子域）？

我最初的简单尝试是：

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这适用于 http://www.foo.com，但不适用于 http://www.foo.com.au。有没有一种方法可以在不使用有关有效 TLD（顶级域名）或国家/地区代码（因为它们会发生变化）的特殊知识的情况下正确执行此操作。

谢谢

原文

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我最亲爱的 2024-08-02 17:22:57

这是有人在看到这个问题后编写的一个很棒的 python 模块来解决这个问题：
https://github.com/john-kurkowski/tldextract

该模块在公共后缀列表，由 Mozilla 志愿者维护

引用：

另一方面，
tldextract 知道所有 gTLD [通用顶级域名]
和 ccTLD [国家/地区代码顶级域名] 看起来像
根据公共后缀查找当前活着的
列表。因此，给定一个 URL，它从其域中知道其子域，以及它的子域。
来自其国家/地区代码的域名。

回复收藏 0 原文

笑叹一世浮沉 2024-08-02 17:22:57

不，没有“内在”方法可以知道（例如）zap.co.it 是一个子域名（因为意大利的注册商确实出售 co.it 等域名）而 zap.co.uk 不是（因为英国的注册商不出售 co.uk 等域名，而只出售<代码>zap.co.uk）。

您只需使用辅助表（或在线资源）来告诉您哪些 TLD 的行为与英国和澳大利亚的 TLD 非常相似——如果没有这些额外的语义知识，则无法通过仅盯着字符串来推断出这一点（当然它可以最终会改变，但如果你能找到一个好的在线资源，那么该资源也会相应地改变，希望如此！-)。

回复收藏 0 原文

绾颜 2024-08-02 17:22:57

使用此有效顶级域名文件其他人在 Mozilla 网站上找到：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果为：

abcde.co.uk

如果有人让我知道上面的哪些部分可以用更Pythonic的方式重写，我将不胜感激。例如，必须有一种更好的方法来迭代 last_i_elements 列表，但我想不出一种方法。我也不知道 ValueError 是否是最好的提高。评论？

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

回复收藏 0 原文

小草泠泠 2024-08-02 17:22:57

使用 python tld

https://pypi.python.org/pypi/tld

安装

pip install tld

从给定的 URL 中获取字符串形式的 TLD 名称

from tld import get_tld
print get_tld("http://www.google.co.uk")

英国

或无协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

英国

以对象形式获取 TLD

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

从给定的 URL 中以字符串形式获取一级域名

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

回复收藏 0 原文

帅气称霸 2024-08-02 17:22:57

TLD 有很多很多。列表如下：

http://data.iana.org/TLD/ tlds-alpha-by-domain.txt

这是另一个列表

http://en. wikipedia.org/wiki/List_of_Internet_top-level_domains

这是另一个列表

http://www .iana.org/domains/root/db/

回复收藏 0 原文

不必你懂 2024-08-02 17:22:57

在更新所有新的 get_tld 之前，我会从错误中提取 tld。当然，这是糟糕的代码，但它可以工作。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

回复收藏 0 原文

荒岛晴空 2024-08-02 17:22:57

我是这样处理的：

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

回复收藏 0 原文

微暖i 2024-08-02 17:22:57

在Python中，我曾经使用tldextract，直到它失败并出现像www.mybrand.sa.com这样的url，将其解析为subdomain='order.mybrand', domain ='sa'，后缀='com'！

所以最后，我决定编写这个方法

重要提示：这仅适用于其中包含子域的网址。这并不意味着取代更高级的库，例如 tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

回复收藏 0 原文

~没有更多了~