Javascript/Regex 用于仅查找没有子域的根域名

发布于 2024-09-13 16:27:30 字数 518 浏览 6 评论 0原文

我进行了搜索,发现了很多类似的正则表达式示例,但不完全是我需要的。

我希望能够传入以下网址并返回结果:

  • www.google.com 返回 google.com

  • sub.domains.are.cool.google.com 返回 google.com >

  • doesntmatterhowlongasubdomainis.idont.wantit.google.com< /强> 返回 google.com

  • sub.domain.google.com/no/thanks 返回 google.com

希望这是有道理的:) 预先感谢!-詹姆斯

I had a search and found lot's of similar regex examples, but not quite what I need.

I want to be able to pass in the following urls and return the results:

  • www.google.com returns google.com

  • sub.domains.are.cool.google.com returns google.com

  • doesntmatterhowlongasubdomainis.idont.wantit.google.com
    returns google.com

  • sub.domain.google.com/no/thanks returns google.com

Hope that makes sense :)
Thanks in advance!-James

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

风苍溪 2024-09-20 16:27:30

您无法使用正则表达式执行此操作,因为您不知道后缀中有多少个块。

例如,google.com 的后缀为 com。要从 subdomain.google.comgoogle.com,您必须获取最后两个块 - 一个用于后缀,另一个用于 google。

如果您将此逻辑应用于 subdomain.google.co.uk,但最终会得到 co.uk

您实际上需要从 http://publicsuffix.org/ 等列表中查找后缀

You can't do this with a regular expression because you don't know how many blocks are in the suffix.

For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.

If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.

You will actually need to look up the suffix from a list like http://publicsuffix.org/

迷鸟归林 2024-09-20 16:27:30

不要使用正则表达式,使用 .split() 方法并从那里开始工作。

var s = domain.split('.');

如果您的用例相当狭窄,您可以根据需要检查 TLD,然后根据需要返回最后 2 或 3 个片段:

return s.slice(-2).join('.');

它会让您的眼睛流血比任何正则表达式解决方案都少。

Don't use regex, use the .split() method and work from there.

var s = domain.split('.');

If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:

return s.slice(-2).join('.');

It'll make your eyes bleed less than any regex solution.

云醉月微眠 2024-09-20 16:27:30

我知道这是一篇较旧的帖子,但这个正则表达式可以很好地匹配:

([^.]+(?:(?:\.[^.]{2,3}){1,2}|\.[^.]+))$

这是一个工作示例:
https://regex101.com/r/2F9pEt/1

I know this is an older post, but this regex works well to match:

([^.]+(?:(?:\.[^.]{2,3}){1,2}|\.[^.]+))$

Here's an example of it working:
https://regex101.com/r/2F9pEt/1

云之铃。 2024-09-20 16:27:30

我没有对此进行大量测试,但如果我理解您的要求,这应该是一个不错的起点...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

编辑:

澄清一下,它正在寻找:

一个或多个字母数字字符或破折号,后跟一个文字点

,然后是以下三项之一...

  1. 三个或更多字母字符(即 com/net/mil/coop 等)
  2. 两个字母字符,后跟一个文字点,后面跟着两个字母(即 co.uk)、
  3. 两个字母字符(即 us/uk/to 等)

,最后是一个单词边界 (\b),表示字符串的结尾、一个空格或非单词字符(在正则表达式中单词字符通常是字母数字和下划线)。

正如我所说,我没有做太多测试,但这似乎是一个合理的起点。您可能需要尝试并对其进行一些调整,即使如此,您也不可能在所有测试用例中获得 100% 的结果。有一些考虑因素,例如 Unicode 域名和各种技术上有效但您可能不会在野外遇到的事情,这些事情会导致像这样的简单正则表达式出错,但这可能会得到你已经完成了 90% 以上的任务。

I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

EDIT:

To clarify, it's looking for:

one or more alpha-numeric characters or dashes, followed by a literal dot

and then one of three things...

  1. three or more alpha characters (i.e. com/net/mil/coop, etc.)
  2. two alpha characters, followed by a literal dot, followed by two more alphas (i.e. co.uk)
  3. two alpha characters (i.e. us/uk/to, etc)

and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).

As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.

晚雾 2024-09-20 16:27:30

如果您的数据子集有限,我建议保持正则表达式简单,例如

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

这将匹配:

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

就我而言,我知道所有相关的 URL 将使用此正则表达式进行匹配。

收集示例数据集并根据您的正则表达式对其进行测试。在原型设计时,您可以使用 https://regex101.com/r/aG9uT0/1< 等工具来完成此操作/a>.在开发中,使用测试脚本将其自动化。

If you have limited subset of data, I suggest to keep the regex simple, e.g.

(([a-z\-]+)(?:\.com|\.fr|\.co.uk))

This will match:

www.google.com --> google.com
www.google.co.uk --> google.co.uk
www.foo-bar.com --> foo-bar.com

In my case, I know that all relevant URLs will be matched using this regex.

Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.

孤星 2024-09-20 16:27:30
([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

这是对 theracoonbear 答案的改进。
我做了一些快速测试,发现如果您给它一个子域有子域的域,它将失败。我还想指出,“90%”绝对不慷慨。它会比你想象的更接近 100%。它适用于访问量最大的 50 个网站的所有子域,这些网站占全球互联网活动的很大一部分。唯一可能失败的情况是使用 unicode 域等。

我的解决方案开始时的工作方式与 theracoonbear 的相同。它不检查单词边界,而是使用负前瞻来检查末尾是否没有可能是 TLD 的内容(只需将 TLD 检查部分复制到负前瞻中)。

([A-Za-z0-9-]+\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))(?!\.([A-Za-z]{3,}|[A-Za-z]{2}\.[A-Za-z]{2}|[A-za-z]{2}))\b

This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.

My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).

浪菊怪哟 2024-09-20 16:27:30

在没有测试顶级域名的有效性的情况下,我使用的是stormsweeper解决方案的改编版:

domain = 'sub.domains.are.cool.google.com'

s = domain.split('.')

tld = s.slice(-2..-1).join('.')

编辑:小心三部分顶级域名(如domain.co.uk)的问题。

Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:

domain = 'sub.domains.are.cool.google.com'

s = domain.split('.')

tld = s.slice(-2..-1).join('.')

EDIT: Be careful of issues with three part TLDs like domain.co.uk.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文