如何使这个正则表达式变得贪婪?

发布于 2025-01-17 04:14:56 字数 1271 浏览 2 评论 0原文

我正在尝试从任何 URL 中提取域 + 子域(没有完整的 URL 后缀或 httpwww 前缀)。

我有 以下域列表

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

我正在使用 以下正则表达式用于提取域+子域:

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

问题在于它将多个域分成两个,例如: d.amazon.ca -> d.ama + zon.ca 并匹配一些非域文本,例如:what-do-lazy-and-greedy-mean-in-the-context -of-regular-expressions 如下图所示:

在此处输入图像描述

如何强制正则表达式变得贪婪,因为它与完整域匹配单场比赛?

我正在使用Java。

I'm trying to extract the domain + subdomain from any URL (without the full URL suffix or http and www prefix).

I have the following lists of domains:

p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com

I'm using the following regex to extract domain + subdomain:

[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?

The issue is that it is splitting several domains into two such as: d.amazon.ca -> d.ama + zon.ca and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions as seen in image below:

enter image description here

How can I force the regex to be greedy in the sense that it matches the full domain as a single match?

I'm using Java.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

未蓝澄海的烟 2025-01-24 04:14:56

我会使用标准 URI 类 而不是正则表达式来解析域:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

输出

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com

I'd use the standard URI class instead of a regular expression to parse out the domain:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.Optional;

public class Demo {
    private static Optional<String> getHostname(String domain) {
        try {
            // Add a scheme if missing
            if (domain.indexOf("://") == -1) {
                domain = "https://" + domain;
            }
            URI uri = new URI(domain);
            return Optional.ofNullable(uri.getHost()).map(s -> s.startsWith("www.") ? s.substring(4) : s);
        } catch (URISyntaxException e) {
            return Optional.empty();
        }
    }

    public static void main(String[] args) {
        String[] domains = new String[] {
            "p.io",
            "amazon.com",
            "d.amazon.ca",
            "domain.amazon.co.uk",
            "https://regex101.com/",
            "www.regex101.comdddd", // .comdddd is (potentially) a valid TLD; not sure why your output removes the d's                                                                                                                            
            "www.wix.com.co",
            "https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions",
            "smile.amazon.com"
        };
        for (String domain : domains) {
            System.out.println(getHostname(domain).orElse("hostname not found"));
        }
    }
}

outputs

p.io
amazon.com
d.amazon.ca
domain.amazon.co.uk
regex101.com
regex101.comdddd
wix.com.co
stackoverflow.com
smile.amazon.com
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文