正确匹配 IDN URL

发布于 2024-08-16 06:30:05 字数 1081 浏览 1 评论 0原文

我需要帮助构建一个可以正确匹配自由文本中的 URL 的正则表达式。

方案
- 以下其中一项：ftp、http、https（ftps 是一种协议吗？）
可选用户（和可选的密码）
主机（支持 IDN）
- 支持www和子域（支持 IDN）
- TLD 的基本过滤（我认为[a-zA-Z]{2,6} 就足够了）
可选端口编号
路径（可选，支持 Unicode 字符）
查询（可选，支持 Unicode 字符）
片段（可选，支持 Unicode 字符）

以下是我可以找到的有关子域的信息：

“子域”表示相对依赖，而不是绝对依赖：例如，wikipedia.org 包括 org 域的子域，以及 en.wikipedia.org 包含一个子域域名 wikipedia.org。在理论上，这个细分可以下降深度达 127 级，每个 DNS 标签最多可以包含 63 个字符，如只要整个域名不总长度超过255 字符。

关于域名本身，我找不到任何可靠的来源，但我认为非 IDN 的正则表达式（我不知道如何编写 IDN 兼容版本）类似于：

[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}

>有人可以帮我解决这个正则表达式或给我指出一个好的方向吗？

原文

I need help building a regular expression that can properly match an URL inside free text.

scheme
- One of the following: ftp, http, https (is ftps a protocol?)
optional user (and optional pass)
host (with support for IDNs)
- support for www and sub-domain(s) (with support for IDNs)
- basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
optional port number
path (optional, with support for Unicode chars)
query (optional, with support for Unicode chars)
fragment (optional, with support for Unicode chars)

Here is what I could find out about sub-domains:

A "subdomain" expresses relative
dependence, not absolute dependence:
for example, wikipedia.org comprises a
subdomain of the org domain, and
en.wikipedia.org comprises a subdomain
of the domain wikipedia.org. In
theory, this subdivision can go down
to 127 levels deep, and each DNS label
can contain up to 63 characters, as
long as the whole domain name does not
exceed a total length of 255
characters.

Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:

[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}

Can someone help me out with this regular expression or point me to a good direction?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纵情客 2024-08-23 06:30:05

John Gruber 因《大胆火球》而闻名，最近发表了一篇文章，详细介绍了他对良好 URL 的追求- 识别正则表达式字符串。他想出的是这样的：

\b(([\w-]+://?|www[.])[^\s()>>]+(?:\([\ w\d]+\)|([^[:punct:]\s]|/)))

这显然也适用于包含 Unicode 的 URL。您需要对其进行轻微修改才能获得您要查找的其余内容 - 方案、用户名、密码等。 Alan Storm 写了一篇解释 Gruber 的正则表达式模式的文章，我绝对需要它（正则表达式是如此编写一次就没有线索如何再次阅读！）。

回复收藏 0 原文

折戟 2024-08-23 06:30:05

如果您需要协议并且不太担心误报，到目前为止最简单的事情就是匹配 :// 周围的所有非空白字符

回复收藏 0 原文

花辞树 2024-08-23 06:30:05

这将帮助您完成大部分工作。如果您需要更精细的请提供测试数据。

(ftp|https?)://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?

This will get you most of the way there. If you need it more refined please provide test data.

(ftp|https?)://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?

回复收藏 0 原文

~没有更多了~

关于作者

别在捏我脸啦

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

正确匹配 IDN URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

不再见

真是无聊啊

樱娆

浅语花开

烛光

绻影浮沉

友情链接

正确匹配 IDN URL

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

不再见

真是无聊啊

樱娆

浅语花开

烛光

绻影浮沉

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。