如何从 URL 中提取顶级域名 (TLD)
如何从 URL 中提取域名(不包括任何子域)?
我最初的简单尝试是:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
这适用于 http://www.foo.com,但不适用于 http://www.foo.com.au。 有没有一种方法可以在不使用有关有效 TLD(顶级域名)或国家/地区代码(因为它们会发生变化)的特殊知识的情况下正确执行此操作。
谢谢
how would you extract the domain name from a URL, excluding any subdomains?
My initial simplistic attempt was:
'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])
This works for http://www.foo.com, but not http://www.foo.com.au.
Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
这是有人在看到这个问题后编写的一个很棒的 python 模块来解决这个问题:
https://github.com/john-kurkowski/tldextract
该模块在 公共后缀列表,由 Mozilla 志愿者维护
引用:
Here's a great python module someone wrote to solve this problem after seeing this question:
https://github.com/john-kurkowski/tldextract
The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers
Quote:
不,没有“内在”方法可以知道(例如)
zap.co.it
是一个子域名(因为意大利的注册商确实出售co.it
等域名)而zap.co.uk
不是(因为英国的注册商不出售co.uk
等域名,而只出售<代码>zap.co.uk)。您只需使用辅助表(或在线资源)来告诉您哪些 TLD 的行为与英国和澳大利亚的 TLD 非常相似——如果没有这些额外的语义知识,则无法通过仅盯着字符串来推断出这一点(当然它可以最终会改变,但如果你能找到一个好的在线资源,那么该资源也会相应地改变,希望如此!-)。
No, there is no "intrinsic" way of knowing that (e.g.)
zap.co.it
is a subdomain (because Italy's registrar DOES sell domains such asco.it
) whilezap.co.uk
isn't (because the UK's registrar DOESN'T sell domains such asco.uk
, but only likezap.co.uk
).You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).
使用此有效顶级域名文件 其他人在 Mozilla 网站上找到:
结果为:
如果有人让我知道上面的哪些部分可以用更Pythonic的方式重写,我将不胜感激。 例如,必须有一种更好的方法来迭代
last_i_elements
列表,但我想不出一种方法。 我也不知道ValueError
是否是最好的提高。 评论?Using this file of effective tlds which someone else found on Mozilla's website:
results in:
I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the
last_i_elements
list, but I couldn't think of one. I also don't know ifValueError
is the best thing to raise. Comments?使用 python
tld
https://pypi.python.org/pypi/tld
安装
从给定的 URL 中获取字符串形式的 TLD 名称
或无协议
以对象形式获取 TLD
从给定的 URL 中以字符串形式获取一级域名
Using python
tld
https://pypi.python.org/pypi/tld
Install
Get the TLD name as string from the URL given
or without protocol
Get the TLD as an object
Get the first level domain name as string from the URL given
TLD 有很多很多。 列表如下:
http://data.iana.org/TLD/ tlds-alpha-by-domain.txt
这是另一个列表
http://en. wikipedia.org/wiki/List_of_Internet_top-level_domains
这是另一个列表
http://www .iana.org/domains/root/db/
There are many, many TLD's. Here's the list:
http://data.iana.org/TLD/tlds-alpha-by-domain.txt
Here's another list
http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
Here's another list
http://www.iana.org/domains/root/db/
在更新所有新的 get_tld 之前,我会从错误中提取 tld。 当然,这是糟糕的代码,但它可以工作。
Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.
我是这样处理的:
Here's how I handle it:
在Python中,我曾经使用tldextract,直到它失败并出现像
www.mybrand.sa.com
这样的url,将其解析为subdomain='order.mybrand', domain ='sa',后缀='com'
!所以最后,我决定编写这个方法
重要提示:这仅适用于其中包含子域的网址。 这并不意味着取代更高级的库,例如 tldextract
In Python I used to use tldextract until it failed with a url like
www.mybrand.sa.com
parsing it assubdomain='order.mybrand', domain='sa', suffix='com'
!!So finally, I decided to write this method
IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract