需要正则表达式来验证 URL 并支持 %20 和 ()

发布于 2024-08-18 06:58:03 字数 1335 浏览 4 评论 0原文

我目前正在使用以下正则表达式来验证 URL：

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?  (?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|edu|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

我从网络上的某个地方借用了这个（不记得在哪里）来改进这一点：

^((https?|file|ftp|gopher|news|nntp):\/\/)([a-z]([a-z0-9\-]*\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel)|(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&amp;]*)?)?(#[a-z][a-z0-9_]*)?$

但是，这些都无法验证此 url（应该是有效的）：

http://somedomain.com/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg

问题是%20和圆括号()。尽我所能，我无法使用上面的任何一个正则表达式来正确验证上面的 url，而不会破坏其他内容。我没有编写花哨的正则表达式的经验，所以这也没有帮助。我发现的所有其他网络结果都在愚蠢的事情上失败，例如：

http://www.test..com

帮助将不胜感激。

原文

I'm currently using the following regular expression to validation URLs:

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?#Username:Password)(?:\w+:\w+@)?  (?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|edu|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:\/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

I borrowed this from somewhere on the web (don't remember where) to improve upon this:

^((https?|file|ftp|gopher|news|nntp):\/\/)([a-z]([a-z0-9\-]*\.)+([a-z]{2}|aero|arpa|biz|com|coop|edu|gov|info|int|jobs|mil|museum|name|nato|net|org|pro|travel)|(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))(\/[a-z0-9_\-\.~]+)*(\/([a-z0-9_\-\.]*)(\?[a-z0-9+_\-\.%=&]*)?)?(#[a-z][a-z0-9_]*)?$

However, neither of these are capable of validating this url (which should be valid):

http://somedomain.com/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg

The problem is the %20 and round brackets (). Try as I might, I couldn't get either of the regex above to correctly validate the url above without breaking something else. I'm not experienced with writing fancy regular expressions, so that doesn't help either. All other web results I've found fail on silly things such as this:

http://www.test..com

Help would be appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独自←快乐 2024-08-25 06:58:03

您正在使用相同的正则表达式验证两件事：

格式良好 - 语法正确吗？
合理——协议和顶级域名合理吗？

分离这些验证可能会富有成效。您可以使用此正则表达式来检查 URI 的格式是否正确。它来自 RFC 3986，统一资源标识符 (URI)：通用语法，附录 B (p .50)：

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

如果 URI 与此正则表达式匹配，则它的格式正确。匹配组为您提供各个部分，它们是：

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

让我们看看您提供的示例 URI 会产生什么结果：

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

现在您已经获得了各个部分，您可以检查每个部分的合理性。例如，要从权威机构获取 TLD，请将此正则表达式应用于权威机构：

\.([^.])$

第 1 组为您提供 TLD（com、org 等），然后您可以对照列表进行检查。

You're validating two things with the same regular expression:

Well formed -- Is it syntactically correct?
Plausible -- Are the protocol and top-level domain plausible?

Separating these validations may be fruitful. You can use this regular expression to check that the URI is well-formed. It's from RFC 3986, Uniform Resource Identifiers (URI): Generic Syntax, appendix B (p. 50):

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

If the URI matches this regular expression, it's well formed. The match groups give you the various pieces, which are:

scheme    = $2
authority = $4
path      = $5
query     = $7
fragment  = $9

Let's see what comes out of the sample URI you gave:

2 (scheme)   : "http"
4 (authority): "somedomain.com"
5 (path)     : "/users/1234/images/Staff%20Photos%202008/FirstName%20LastName_1%20(Small).jpg"
7 (query)    : nil
9 (fragment) : nil

Now that you've got the individual pieces, you can check each one for plausibility. For example, to get the TLD from the authority, apply this regular expression to the authority: