如何使这个正则表达式变得贪婪？

发布于 2025-01-17 04:14:56 字数 1271 浏览 2 评论 0原文

我正在尝试从任何 URL 中提取域 + 子域（没有完整的 URL 后缀或 http 和 www 前缀）。

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

我正在使用以下正则表达式用于提取域+子域：

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

问题在于它将多个域分成两个，例如： d.amazon.ca -> d.ama + zon.ca 并匹配一些非域文本，例如：what-do-lazy-and-greedy-mean-in-the-context -of-regular-expressions 如下图所示：

如何强制正则表达式变得贪婪，因为它与完整域匹配单场比赛？

我正在使用Java。

原文

I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).

I have the following lists of domains:

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

I'm using the following regex to extract domain + subdomain:

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:

How can I force the regex to be greedy in the sense that it matches the full domain as a single match?

I'm using Java.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

未蓝澄海的烟 2025-01-24 04:14:56

我会使用标准 URI 类而不是正则表达式来解析域：

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

输出

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com

I'd use the standard URI class instead of a regular expression to parse out the domain:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

outputs

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com

回复收藏 0 原文

~没有更多了~

关于作者

油饼

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何使这个正则表达式变得贪婪？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如何使这个正则表达式变得贪婪？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。