如何使这个正则表达式变得贪婪?
我正在尝试从任何 URL 中提取域 + 子域(没有完整的 URL
后缀或 http
和 www
前缀)。
我有 以下域列表:
p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com
我正在使用 以下正则表达式用于提取域+子域:
[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?
问题在于它将多个域分成两个,例如: d.amazon.ca
-> d.ama
+ zon.ca
并匹配一些非域文本,例如:what-do-lazy-and-greedy-mean-in-the-context -of-regular-expressions
如下图所示:
如何强制正则表达式变得贪婪,因为它与完整域匹配单场比赛?
我正在使用Java。
I'm trying to extract the domain + subdomain from any URL (without the full URL
suffix or http
and www
prefix).
I have the following lists of domains:
p.io -> p.io
amazon.com -> amazon.com
d.amazon.ca -> d.amazon.ca
domain.amazon.co.uk -> domain.amazon.co.uk
https://regex101.com/ -> regex101.com
www.regex101.comdddd -> regex101.com
www.wix.com.co -> wix.com.co
https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions -> stackoverflow.com
smile.amazon.com -> smile.amazon.com
I'm using the following regex to extract domain + subdomain:
[^w.\:\/]+[a-zA-Z\.]?\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?(\.[a-zA-Z]{0,3})?|[w]{1,2}[^w.]+\.[a-zA-Z]{1,3}(\.[a-zA-Z]{1,3})?
The issue is that it is splitting several domains into two such as: d.amazon.ca
-> d.ama
+ zon.ca
and matching some non domain text such as: what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions
as seen in image below:
How can I force the regex to be greedy in the sense that it matches the full domain as a single match?
I'm using Java.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会使用标准 URI 类 而不是正则表达式来解析域:
输出
I'd use the standard URI class instead of a regular expression to parse out the domain:
outputs