我需要编写一个函数来解析包含域名的变量。 我最好用一个例子来解释这一点,变量可以包含以下任何内容:
here.example.com
example.com
example.org
here.example.org
但是当通过我的函数时,所有这些都必须返回 example.com 或 example.co.uk,基本上是根域名。 我确信我以前曾经这样做过,但我已经在 Google 上搜索了大约 20 分钟,但什么也没找到。 任何帮助,将不胜感激。
编辑:忽略 .co.uk,假设所有经历此功能的域都有一个 3 字母 TLD。
I need to write a function to parse variables which contain domain names. It's best I explain this with an example, the variable could contain any of these things:
here.example.com
example.com
example.org
here.example.org
But when passed through my function all of these must return either example.com or example.co.uk, the root domain name basically. I'm sure I've done this before but I've been searching Google for about 20 minutes and can't find anything. Any help would be appreciated.
EDIT: Ignore the .co.uk, presume that all domains going through this function have a 3 letter TLD.
发布评论
评论(27)
Stackoverflow 问题存档:
Stackoverflow Question Archive:
如果您想要一个快速简单的解决方案,无需外部调用并检查预定义的数组。 与最流行的答案不同,也适用于“www.domain.gallery”等新域。
If you want a fast simple solution, without external calls and checking against predefined arrays. Works for new domains like "www.domain.gallery" also, unlike the most popular answer.
我会做类似以下的事情:
I would do something like the following:
我最终使用了 Mozilla 的数据库。
这是我的代码:
fetch_mozilla_tlds.php 包含缓存算法。 这行很重要:
应用程序内部使用的主文件是这样的:
更新:
该数据库已经发展,现在可以在自己的网站上找到 - http://publicsuffix.org/
I ended up using the database Mozilla has.
Here's my code:
fetch_mozilla_tlds.php contains caching algorhythm. This line is important:
The main file used inside the application is this:
UPDATE:
The database has evolved and is now available at its own website - http://publicsuffix.org/
几乎可以肯定,您正在寻找的是:
https://github.com/Synchro/regdom- php
这是一个 PHP 库,它利用(尽可能实用)在 publicsuffix.org/list/ 收集的各种 TLD 的完整列表,并将其包装在一个漂亮的小函数中。
包含库后,就很简单:
$registeredDomain = getRegisteredDomain( $domain );
Almost certainly, what you're looking for is this:
https://github.com/Synchro/regdom-php
It's a PHP library that utilizes the (as nearly as is practical) full list of various TLD's that's collected at publicsuffix.org/list/ , and wraps it up in a spiffy little function.
Once the library is included, it's as easy as:
$registeredDomain = getRegisteredDomain( $domain );
从主机中提取子域名有两种方法:
第一种更准确的方法是使用 tld 数据库(例如 public_suffix_list.dat)并与其匹配域名。 在某些情况下这有点重。 有一些 PHP 类可以使用它,例如 php-domain-parser 和 TLDExtract。
第二种方式不如第一种准确,但是速度非常快,并且在很多情况下都能给出正确的答案,我为它编写了这个函数:
示例:
返回:
<前><代码>数组
(
[协议] => https
[子域] => 我的子域
[域名] => 域名.co.uk
[主持人] => 领域
[tld] => 英国公司
)
There are two ways to extract subdomain from a host:
The first method that is more accurate is to use a database of tlds (like public_suffix_list.dat) and match domain with it. This is a little heavy in some cases. There are some PHP classes for using it like php-domain-parser and TLDExtract.
The second way is not as accurate as the first one, but is very fast and it can give the correct answer in many case, I wrote this function for it:
Example:
Returns:
这是实现这一目标的一个简短方法:
This is a short way of accomplishing that:
基于 http://www.cafewebmaster.com/find-top -level-domain-international-urls-php
Based on http://www.cafewebmaster.com/find-top-level-domain-international-urls-php
作为乔纳森·桑普森的变体
As a variant to Jonathan Sampson
此脚本生成一个 Perl 文件,其中包含来自 ETLD 文件的单个函数 get_domain。 假设您在 .photobucket.com 中有类似 img1、img2、img3 等主机名。 对于每个 get_domain $host 将返回 photobucket.com。 请注意,这不是地球上最快的函数,因此在使用此函数的主日志解析器中,我保留主机到域映射的哈希值,并且仅对尚未包含在哈希值中的主机运行此函数。
This script generates a Perl file containing a single function, get_domain from the ETLD file. So say you have hostnames like img1, img2, img3, ... in .photobucket.com. For each of those get_domain $host would return photobucket.com. Note that this isn't the fastest function on earth, so in my main log parser that's using this, I keep a hash of host to domain mappings and only run this for hosts that aren't in the hash yet.
正如已经说过的公共后缀列表只是正确解析域的一种方法。 我推荐 TLDExtract 包,这里是示例代码:
As already said Public Suffix List is only one way to parse domain correctly. I recomend TLDExtract package, here is sample code:
这并不是万无一失的,只有当您知道域不会变得晦涩难懂时才应该真正使用它,但它比大多数其他选项更容易阅读:
This isn't foolproof and should only really be used if you know the domain isn't going to be anything obscure, but it's easier to read than most of the other options:
正则表达式可以帮助你。 尝试这样的:
([^.]+(.com|.co.uk))$
Regex could help you out there. Try something like this:
([^.]+(.com|.co.uk))$
我认为你的问题是你没有明确定义你到底想要这个函数做什么。 从您的示例来看,您当然不希望它只是盲目地返回名称的最后两个或最后三个组成部分,但仅仅知道它不应该做什么是不够的。
以下是我对您真正想要的猜测:您希望将某些二级域名视为单个 TLD(顶级域名),例如
co.uk.
此功能的目的。 在这种情况下,我建议枚举所有此类情况,并将它们作为键放入具有虚拟值的关联数组中,以及所有正常的顶级域,例如com.
、net.< /code>、
info.
等。然后,每当您获得新域名时,提取最后两个组件并查看生成的字符串是否在您的数组中作为键。 如果没有,请仅提取最后一个组件并确保它位于您的数组中。 (如果不是,则它不是有效的域名)无论哪种方式,无论您在数组中找到什么键,都将其加上域名末尾的一个组件,您将拥有您的基本域。也许,您可以通过编写一个函数(而不是使用关联数组)来判断最后两个组成部分是否应被视为单个“有效 TLD”,从而使事情变得更简单。 该函数可能会查看倒数第二个组成部分,如果它少于 3 个字符,则决定将其视为 TLD 的一部分。
I think your problem is that you haven't clearly defined what exactly you want the function to do. From your examples, you certainly don't want it to just blindly return the last two, or last three, components of the name, but just knowing what it shouldn't do isn't enough.
Here's my guess at what you really want: there are certain second-level domain names, like
co.uk.
, that you'd like to be treated as a single TLD (top-level domain) for purposes of this function. In that case I'd suggest enumerating all such cases and putting them as keys into an associative array with dummy values, along with all the normal top-level domains likecom.
,net.
,info.
, etc. Then whenever you get a new domain name, extract the last two components and see if the resulting string is in your array as a key. If not, extract just the last component and make sure that's in your array. (If even that isn't, it's not a valid domain name) Either way, whatever key you do find in the array, take that plus one more component off the end of the domain name, and you'll have your base domain.You could, perhaps, make things a bit simpler by writing a function, instead of using an associative array, to tell whether the last two components should be treated as a single "effective TLD." The function would probably look at the next-to-last component and, if it's shorter than 3 characters, decide that it should be treated as part of the TLD.
为了做好它,您需要一个二级域和顶级域的列表,并构建适当的正则表达式列表。 https://wiki.mozilla.org/TLD_List 提供了详细的二级域名列表。 除了上述 CentralNic .uk.com 变体之外,另一个测试案例是梵蒂冈:从技术上讲,他们的网站位于 http://va :这很难匹配!
To do it well, you'll need a list of the second level domains and top level domains and build an appropriate regular expression list. A good list of second level domains is available at https://wiki.mozilla.org/TLD_List. Another test case apart from the aforementioned CentralNic .uk.com variants is The Vatican: their website is technically at http://va : and that's a difficult one to match on!
基于乔纳森的回答:
他的表达可能会好一点,但这个界面看起来更像你所描述的。
Building on Jonathan's answer:
His expression might be a bit better, but this interface seems more like what you're describing.
啊 - 如果您只想处理三个字符顶级域 - 那么此代码有效:
$matches[1]/$matches[2] 将包含任何子域和/或协议,$matches[3] 包含域名,$ matches[4] 顶级域,$matches[5] 包含任何其他 URL 路径信息。
要匹配最常见的顶级域,您可以尝试将其更改为:
或者让它处理所有内容:
等等
Ah - if you just want to handle three character top level domains - then this code works:
$matches[1]/$matches[2] will contain any subdomain and/or protocol, $matches[3] contains the domain name, $matches[4] the top level domain and $matches[5] contains any other URL path information.
To match most common top level domains you could try changing it to:
Or to get it coping with everything:
etc etc
这是我正在使用的:
它工作得很好,不需要任何 tld 数组
Here is what I am using:
It works great without needing any arrays for tld's
如果不使用 TLD 列表进行比较是不可能的,因为它们存在许多情况,例如
http://www.db.de/ 或 http://bbc.co.uk/
但即使如此,你也不会在所有情况下都取得成功,因为 SLD 就像 http://big.uk .com/ 或 http://www.uk.com/
如果您需要完整列表您可以使用公共后缀列表:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/ effective_tld_names.dat?raw=1
欢迎使用我的功能。 它不会使用正则表达式,而且速度很快:
http://www. programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878
It is not possible without using a TLD list to compare with as their exist many cases like
http://www.db.de/ or http://bbc.co.uk/
But even with that you won't have success in every case because of SLD's like http://big.uk.com/ or http://www.uk.com/
If you need a complete list you can use the public suffix list:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
Feel free to use my function. It won't use regex and it is fast:
http://www.programmierer-forum.de/domainnamen-ermitteln-t244185.htm#3471878
以下是从任何 URL 中剥离 TLD 的方法 - 我编写了在我的网站上运行的代码:
http:// /internet-portal.me/ - 这是我的网站上使用的有效解决方案。
$host 是必须解析的 URL。 这段代码是一个简单的解决方案并且可靠
与我见过的所有其他内容相比,它适用于我尝试过的任何 URL!!!
请参阅此代码解析您现在正在查看的页面!
http://internet-portal.me/domain/?dns=https://stackoverflow.com/questions/1201194/php-getting-domain-name-from-subdomain/6320437#6320437
=================================================== ===============================
Here is how you strip the TLD from any URL - I wrote the code to work on my site:
http://internet-portal.me/ - This is a working solution that is used on my site.
$host is the URL that has to be parsed. This code is a simple solution and reliable
compared to everything else I have seen, It works on any URL that I have tried!!!
see this code parsing the page you are looking at right now!
http://internet-portal.me/domain/?dns=https://stackoverflow.com/questions/1201194/php-getting-domain-name-from-subdomain/6320437#6320437
================================================================================
无需列出所有国家/地区 TLD,除了 IANA 列出的特殊 TLD 之外,它们都是 2 个字母
https:// /gist.github.com/pocesar/5366899
测试在这里 http://codepad. viper-7.com/QfueI0
全面的测试套件以及工作代码。 唯一需要注意的是它不适用于 unicode 域名,但这是另一个级别的数据提取。
从列表中,我正在测试:
No need for listing all the countries TLD, they are all 2 letters, besides the special ones listed by IANA
https://gist.github.com/pocesar/5366899
and the tests are here http://codepad.viper-7.com/QfueI0
Comprehensive test suit along with working code. The only caveat is that it won't work with unicode domain names, but that's another level of data extraction.
From the list, I'm testing against:
无论如何这是为了获取domain.tld
This is to get domain.tld in any case
我的版本还返回协议
My version also returns the protocol
不需要
REGEX
。 存在原生parse_url
:NO NEED FOR
REGEX
. There exists nativeparse_url
: