去 PHP parse_url() 不去的地方 - 仅解析域名

发布于 2024-07-10 18:17:21 字数 509 浏览 5 评论 0 原文

PHP 的 parse_url() 有一个主机字段,其中包括完整的主机。 我正在寻找最可靠(且成本最低)的方法来仅返回域名和 TLD。

给出示例:

我只查找 google.comgoogle.co.uk。 我考虑了一张有效顶级域名/后缀表,并且只允许使用这些和一个单词。 你会用其他方式做吗? 有谁知道针对此类事情的预装有效正则表达式?

PHP's parse_url() has a host field, which includes the full host. I'm looking for the most reliable (and least costly) way to only return the domain and TLD.

Given the examples:

I am looking for only google.com or google.co.uk. I have contemplated a table of valid TLD's/suffixes and only allowing those and one word. Would you do it any other way? Does anyone know of a pre-canned valid REGEX for this sort of thing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

提笔落墨 2024-07-17 18:17:21

类似的事情怎么样?

function getDomain($url) {
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

将使用经典的 parse_url 提取域名,然后查找没有任何子域的有效域(www 是子域)。 不适用于“localhost”之类的东西。 如果不匹配任何内容,将返回 false。

//编辑:

尝试一下:

echo getDomain('http://www.google.com/test.html') . '<br/>';
echo getDomain('https://news.google.co.uk/?id=12345') . '<br/>';
echo getDomain('http://my.subdomain.google.com/directory1/page.php?id=abc') . '<br/>';
echo getDomain('https://testing.multiple.subdomain.google.co.uk/') . '<br/>';
echo getDomain('http://nothingelsethan.com') . '<br/>';

它应该返回:

google.com
google.co.uk
google.com
google.co.uk
nothingelsethan.com

当然,如果没有通过parse_url,因此请确保它是格式正确的 URL。

// 附录:

Alnitak 是对的。 上面提出的解决方案适用于大多数情况,但不一定适用于所有情况,并且需要进行维护,以确保它们不是带有 .morethan6 个字符等的新 TLD。 提取域名的唯一可靠方法是使用维护的列表,例如 http://publicsuffix.org/。 一开始会比较痛苦,但从长远来看会更容易、更稳健。 您需要确保了解每种方法的优缺点以及它如何适合您的项目。

How about something like that?

function getDomain($url) {
  $pieces = parse_url($url);
  $domain = isset($pieces['host']) ? $pieces['host'] : '';
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

Will extract the domain name using the classic parse_url and then look for a valid domain without any subdomain (www being a subdomain). Won't work on things like 'localhost'. Will return false if it didn't match anything.

// Edit:

Try it out with:

echo getDomain('http://www.google.com/test.html') . '<br/>';
echo getDomain('https://news.google.co.uk/?id=12345') . '<br/>';
echo getDomain('http://my.subdomain.google.com/directory1/page.php?id=abc') . '<br/>';
echo getDomain('https://testing.multiple.subdomain.google.co.uk/') . '<br/>';
echo getDomain('http://nothingelsethan.com') . '<br/>';

And it should return:

google.com
google.co.uk
google.com
google.co.uk
nothingelsethan.com

Of course, it won't return anything if it doesn't get through parse_url, so make sure it's a well-formed URL.

// Addendum:

Alnitak is right. The solution presented above will work in most cases but not necessarily all and needs to be maintained to make sure, for example, that their aren't new TLD with .morethan6characters and so on. The only reliable way of extracting the domain is to use a maintained list such as http://publicsuffix.org/. It's more painful at first but easier and more robust on the long-term. You need to make sure you understand the pros and cons of each method and how it fits with your project.

记忆消瘦 2024-07-17 18:17:21

目前,唯一“正确”的方法是使用一个列表,例如 http://publicsuffix.org/< 中维护的列表/a>

顺便说一句,这个问题也几乎是重复的:

IETF 正在开展标准化工作,着眼于声明 DNS 树中的特定节点是否用于“公共”注册的 DNS 方法,但它们仍处于开发的早期阶段。 所有流行的非 IE 浏览器都使用 publicsuffix.org 列表。

Currently the only "right" way to do this is to use a list such as that maintained at http://publicsuffix.org/

BTW, this question is also pretty much a duplicate of:

There are standardisation efforts at IETF looking at DNS methods of declaring whether a particular node in the DNS tree is used for "public" registrations, but they're in their early stages of development. All of the popular non-IE browsers use the publicsuffix.org list.

做个ˇ局外人 2024-07-17 18:17:21

Python 的 tldextract 模块还有一个非常好的移植 http://w-shadow。 com/blog/2012/08/28/tldextract - 这超出了 parse_url 的范围,并允许您实际获取域/tld,而不需要子域。

来自模块网站:

$components = tldextract('http://www.bbc.co.uk');
echo $components->subdomain; // www
echo $components->domain;    // bbc
echo $components->tld;       // co.uk

There is also a very nice port of Python's tldextract module http://w-shadow.com/blog/2012/08/28/tldextract - this goes beyond parse_url and allows you to actually get the domain/tld out, without the subdomain.

From the module website:

$components = tldextract('http://www.bbc.co.uk');
echo $components->subdomain; // www
echo $components->domain;    // bbc
echo $components->tld;       // co.uk
云淡风轻 2024-07-17 18:17:21

从相关帖子中挖出此内容,以了解保留表格的想法: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/ effective_tld_names.dat?raw=1

但我宁愿不这样做。

Dug this up from a related post, for the idea of keeping a table: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/src/effective_tld_names.dat?raw=1

I'd rather not do that though.

才能让你更想念 2024-07-17 18:17:21

您需要使用公共后缀列表的软件包,只有这样您才能正确提取具有两级、三级TLD的域名(co.uk、a.bg、b.bg 等)和多级子域。 正则表达式、parse_url() 或字符串函数永远不会产生绝对正确的结果。

我建议使用 TLD 提取。 这里是代码示例:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('http://www.google.co.uk/foo');
$result->getSubdomain(); // will return (string) 'www'
$result->getHostname(); // will return (string) 'google'
$result->getSuffix(); // will return (string) 'co.uk'
$result->getRegistrableDomain(); // will return (string) 'google.co.uk'

You need package that uses Public Suffix List, only in this way you can correctly extract domains with two-, third-level TLDs (co.uk, a.bg, b.bg, etc.) and multilevel subdomains. Regex, parse_url() or string functions will never produce absolutely correct result.

I recomend use TLD Extract. Here example of code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('http://www.google.co.uk/foo');
$result->getSubdomain(); // will return (string) 'www'
$result->getHostname(); // will return (string) 'google'
$result->getSuffix(); // will return (string) 'co.uk'
$result->getRegistrableDomain(); // will return (string) 'google.co.uk'
对你的占有欲 2024-07-17 18:17:21

当然,这取决于您的具体用例,但一般来说,我不会对 TLD 使用表查找。 新 TLD 出现后,您通常不想在任何地方维护它们。 只需询问我的 [email protected] 因短视而被拒绝的频率。

我想如果我知道你为什么不想要 www,我可以提供更好的帮助? 您需要它来发送电子邮件吗? 在这种情况下,您可以查询 MX 记录以验证它(最终)接受邮件。

您还可以找到有关处理 DNS 记录的 PHP 函数的帮助,以了解有关它们的更多信息,请参阅 http://php.net例如 /dns_get_record

Of course it depends on your specific use case, but generally speaking I would not use a table lookup for TLDs. New TLDs come out and you usually don't want to maintain them anywhere. Just ask me how often my [email protected] has been rejected because of shortsightedness.

I guess I could help better if I knew why you not want the www? Do you need it for emails? You can query for MX records in such cases to verify it (eventually) accepts mails.

You may also find help with PHP functions dealing with DNS records to find out more information about them, see http://php.net/dns_get_record for example.

故人的歌 2024-07-17 18:17:21

只是一个证明,假设允许的顶级域名被存储到哈希中。
代码可以缩短很多。

<?php
    $urlCompoments=parse_url($theUrl);
    $chunk=explode('.',$urlComponents['host']);

    $tldIndex = count($chunk-1); // assume last chunk is tld
    $maxTldLen = 2; // assuming a tld can be in the form .com or .co.uk
    $cursor=1;
    $found=false;
    while(($cursor<=$maxTldLen) or $found) {
      $tls = implode('.',array_slice($chunk, -$cursor));
      $found=isset($tldSuffixesAllowed[$tld]);
      $cursor++;
    }
    if ($found){
       $tld=implode('.',array_slice($chunk, -$cursor));
    } else {
       // domain not recognized, do wathever you want
    }
?>

Just a proof, assuming the allowed tlds are memorized into an hash.
The code can be shortened a lot.

<?php
    $urlCompoments=parse_url($theUrl);
    $chunk=explode('.',$urlComponents['host']);

    $tldIndex = count($chunk-1); // assume last chunk is tld
    $maxTldLen = 2; // assuming a tld can be in the form .com or .co.uk
    $cursor=1;
    $found=false;
    while(($cursor<=$maxTldLen) or $found) {
      $tls = implode('.',array_slice($chunk, -$cursor));
      $found=isset($tldSuffixesAllowed[$tld]);
      $cursor++;
    }
    if ($found){
       $tld=implode('.',array_slice($chunk, -$cursor));
    } else {
       // domain not recognized, do wathever you want
    }
?>
辞别 2024-07-17 18:17:21

有一个非常简单的解决方案:

function get_domain($url) {
  $pieces = parse_url($url);
  return array_pop(explode('.', $pieces['host'], 2));
}

这肯定有效吗?

There is a really easy solution to this:

function get_domain($url) {
  $pieces = parse_url($url);
  return array_pop(explode('.', $pieces['host'], 2));
}

Surely this will work?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文