Javascript/Regex 用于仅查找没有子域的根域名
我进行了搜索,发现了很多类似的正则表达式示例,但不完全是我需要的。
我希望能够传入以下网址并返回结果:
www.google.com 返回 google.com
sub.domains.are.cool.google.com 返回 google.com >
doesntmatterhowlongasubdomainis.idont.wantit.google.com< /强> 返回 google.com
sub.domain.google.com/no/thanks 返回 google.com
希望这是有道理的:) 预先感谢!-詹姆斯
I had a search and found lot's of similar regex examples, but not quite what I need.
I want to be able to pass in the following urls and return the results:
www.google.com returns google.com
sub.domains.are.cool.google.com returns google.com
doesntmatterhowlongasubdomainis.idont.wantit.google.com
returns google.comsub.domain.google.com/no/thanks returns google.com
Hope that makes sense :)
Thanks in advance!-James
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您无法使用正则表达式执行此操作,因为您不知道后缀中有多少个块。
例如,google.com 的后缀为 com。要从 subdomain.google.com 到 google.com,您必须获取最后两个块 - 一个用于后缀,另一个用于 google。
如果您将此逻辑应用于 subdomain.google.co.uk,但最终会得到 co.uk。
您实际上需要从 http://publicsuffix.org/ 等列表中查找后缀
You can't do this with a regular expression because you don't know how many blocks are in the suffix.
For example google.com has a suffix of com. To get from subdomain.google.com to google.com you'd have to take the last two blocks - one for the suffix and one for google.
If you apply this logic to subdomain.google.co.uk though you would end up with co.uk.
You will actually need to look up the suffix from a list like http://publicsuffix.org/
不要使用正则表达式,使用 .split() 方法并从那里开始工作。
如果您的用例相当狭窄,您可以根据需要检查 TLD,然后根据需要返回最后 2 或 3 个片段:
它会让您的眼睛流血比任何正则表达式解决方案都少。
Don't use regex, use the .split() method and work from there.
If your use case is fairly narrow you could then check the TLDs as needed, and then return the last 2 or 3 segments as appropriate:
It'll make your eyes bleed less than any regex solution.
我知道这是一篇较旧的帖子,但这个正则表达式可以很好地匹配:
这是一个工作示例:
https://regex101.com/r/2F9pEt/1
I know this is an older post, but this regex works well to match:
Here's an example of it working:
https://regex101.com/r/2F9pEt/1
我没有对此进行大量测试,但如果我理解您的要求,这应该是一个不错的起点...
编辑:
澄清一下,它正在寻找:
一个或多个字母数字字符或破折号,后跟一个文字点
,然后是以下三项之一...
,最后是一个单词边界 (\b),表示字符串的结尾、一个空格或非单词字符(在正则表达式中单词字符通常是字母数字和下划线)。
正如我所说,我没有做太多测试,但这似乎是一个合理的起点。您可能需要尝试并对其进行一些调整,即使如此,您也不可能在所有测试用例中获得 100% 的结果。有一些考虑因素,例如 Unicode 域名和各种技术上有效但您可能不会在野外遇到的事情,这些事情会导致像这样的简单正则表达式出错,但这可能会得到你已经完成了 90% 以上的任务。
I've not done a lot of testing on this, but if I understand what you're asking for, this should be a decent starting point...
EDIT:
To clarify, it's looking for:
one or more alpha-numeric characters or dashes, followed by a literal dot
and then one of three things...
and at the end of that, a word boundary (\b) meaning the end of the string, a space, or a non-word character (in regex word characters are typically alpha-numerics, and underscore).
As I say, I didn't do much testing, but it seemed a reasonable jumping off point. You'd likely need to try it and tune it some, and even then, it's unlikely that you'll get 100% for all test cases. There are considerations like Unicode domain names and all sorts of technically-valid-but-you'll-likely-not-encounter-in-the-wild things that'll trip up a simple regex like this, but this'll probably get you 90%+ of the way there.
如果您的数据子集有限,我建议保持正则表达式简单,例如
这将匹配:
就我而言,我知道所有相关的 URL 将使用此正则表达式进行匹配。
收集示例数据集并根据您的正则表达式对其进行测试。在原型设计时,您可以使用 https://regex101.com/r/aG9uT0/1< 等工具来完成此操作/a>.在开发中,使用测试脚本将其自动化。
If you have limited subset of data, I suggest to keep the regex simple, e.g.
This will match:
In my case, I know that all relevant URLs will be matched using this regex.
Collect a sample dataset and test it against your regex. While prototyping, you can do that using a tool such https://regex101.com/r/aG9uT0/1. In development, automate it using a test script.
这是对 theracoonbear 答案的改进。
我做了一些快速测试,发现如果您给它一个子域有子域的域,它将失败。我还想指出,“90%”绝对不慷慨。它会比你想象的更接近 100%。它适用于访问量最大的 50 个网站的所有子域,这些网站占全球互联网活动的很大一部分。唯一可能失败的情况是使用 unicode 域等。
我的解决方案开始时的工作方式与 theracoonbear 的相同。它不检查单词边界,而是使用负前瞻来检查末尾是否没有可能是 TLD 的内容(只需将 TLD 检查部分复制到负前瞻中)。
This is an improvement upon theracoonbear's answer.
I did a quick bit of testing and noticed that if you give it a domain where the subdomain has a subdomain, it will fail. I also wanted to point out that the "90%" was definitely not generous. It will be a lot closer to 100% than you think. It works on all subdomains of the top 50 most visited websites which accounts for a huge chunk of worldwide internet activity. The only time it would fail is potentially with unicode domains, etc.
My solution starts off working the same way that theracoonbear's does. Instead of checking for a word boundary, it uses a negative lookahead to check if there is not something that could be a TLD at the end (just copied the TLD checking part over into a negative lookahead).
在没有测试顶级域名的有效性的情况下,我使用的是stormsweeper解决方案的改编版:
编辑:小心三部分顶级域名(如domain.co.uk)的问题。
Without testing the validity of top level domain, I'm using an adaptation of stormsweeper's solution:
EDIT: Be careful of issues with three part TLDs like domain.co.uk.