URL 可以包含分号并且仍然有效吗?
我正在使用正则表达式将纯文本 URL 转换为可点击的链接。
@(https?://([-\w\.]+)+(:\d+)?(/([\w/_\.-]*(\?\S+)?)?) ?)@
然而,有时在文本正文中,URL 会每行枚举一个,并在末尾加一个分号。 真实的URL不包含任何“;”。
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=275;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=123;
http://www.aaa.org/pressdetail.asp?PRESS_REL_ID=124
URL 中是否允许使用分号 (;) 或者分号是否可以被视为 URL 结尾的标记? 这如何适合我的正则表达式?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
分号是保留的并且只能用于其特殊目的(这取决于方案) 。
第 2.2 节:
A semicolon is reserved and should only for its special purpose (which depends on the scheme).
Section 2.2:
W3C 鼓励 CGI 程序接受
;
以及查询字符串中的&
(即对待?name=fred&) Age=50
和?name=fred;age=50
的方式相同)。这应该是因为
&
必须在 HTML 中编码为&
,而;
则不然。The W3C encourages CGI programs to accept
;
as well as&
in query strings (i.e. treat?name=fred&age=50
and?name=fred;age=50
the same way).This is supposed to be because
&
has to be encoded as&
in HTML whereas;
doesn't.分号是合法的URI字符; 它属于子分隔符类别: http://www.ietf.org/rfc/rfc3986 .txt
但是,规范指出分号对于特定 URI 是否合法取决于该 URI 的方案或生产者。 因此,如果使用这些链接的网站不允许使用分号,那么它们对于该特定情况无效。
The semi-colon is a legal URI character; it belongs to the sub-delimiter category: http://www.ietf.org/rfc/rfc3986.txt
However, the specification states that whether the semi-colon is legitimate for a specific URI or not depends on the scheme or producer of that URI. So, if site using those links doesn't allow semi-colons, then they're not valid for that particular case.
从技术上讲,分号是 URL 字符串中的合法子分隔符; 上面引用了大量的源材料,包括 http://www.ietf.org/rfc/rfc3986.txt 。
有些人确实将其用于合法目的,尽管它的使用可能是特定于站点的(即仅用于该站点),因为它的使用必须由使用它的站点定义。
然而,在现实世界中,URL 中分号的主要用途是将病毒或网络钓鱼 URL 隐藏在合法 URL 后面。
例如,向某人发送包含以下链接的电子邮件:
http://www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/
将导致 Yahoo! 链接(www.yahoo.com/junk/nonsense)被忽略,因为即使它是合法的(即,正确形成的),也不存在这样的页面。 但第二个链接 (0200.0xfe.0x37.0xbf/malicious_file/) 可能存在*,并且用户将被定向到恶意文件页面; 随后,公司 IT 经理将收到一份报告,而员工可能会收到一份解雇通知书。
在所有反对者都激动起来之前,这正是新的 Facebook 网络钓鱼问题的运作方式。 像往常一样,为了保护有罪者,这些人的名字都被更改了。
*据我所知,实际上不存在这样的页面。 显示的链接仅用于本次讨论。
Technically, a semicolon is a legal sub-delimiter in a URL string; plenty of source material is quoted above including http://www.ietf.org/rfc/rfc3986.txt.
And some do use it for legitimate purposes though it's use is likely site-specific (ie, only for use with that site) because it's usage has to be defined by the site using it.
In the real world however, the primary use for semicolons in URLs is to hide a virus or phishing URL behind a legitimate URL.
For example, sending someone an email with this link:
http:// www.yahoo.com/junk/nonsense;0200.0xfe.0x37.0xbf/malicious_file/
will result in the Yahoo! link (www.yahoo.com/junk/nonsense) being ignored because even though it is legitimate (ie, properly formed) no such page exists. But the second link (0200.0xfe.0x37.0xbf/malicious_file/) presumably exists* and the user will be directed to the malicious_file page; whereupon one's corporate IT manager will get a report and one will likely get a pink slip.
And before all the nay-sayers get their dander up, this is exactly how the new Facebook phishing problem works. The names have been changed to protect the guilty as usual.
*No such page actually exists to my knowledge. The link shown is for purposes of this discussion only.
是的,分号在 URL 中有效。 但是,如果您从相对非结构化的散文中提取它们,则可以安全地假设 URL 末尾的分号表示句子标点符号。 这同样适用于其他句子标点字符,如句号、问号、引号等。
如果您只对具有显式
http[s]
协议的 URL 感兴趣,并且您的正则表达式风格支持lookbehinds ,这个正则表达式应该足够了:https?://[\w!#$%&'()*+,./:;=?@\[\]-]+(?
在协议之后,它只是匹配 URL 中可能有效的一个或多个字符,根本不用担心结构。但随后它会退回尽可能多的位置根据需要,直到最后一个字符不是句子标点符号。
Yes, semicolons are valid in URLs. However, if you're plucking them from relatively unstructured prose, it's probably safe to assume a semicolon at the end of a URL is meant as sentence punctuation. The same goes for other sentence-punctuation characters like periods, question marks, quotes, etc..
If you're only interested in URLs with an explicit
http[s]
protocol, and your regex flavor supports lookbehinds, this regex should suffice:https?://[\w!#$%&'()*+,./:;=?@\[\]-]+(?<![!,.?;:"'()-])
After the protocol, it simply matches one or more characters that may be valid in a URL, without worrying about structure at all. But then it backs off as many positions as necessary until the final character is not something that might be sentence punctuation.
http://www.ietf.org/rfc/rfc3986.txt 涵盖 URL 以及内容字符可能以未编码的形式出现。 鉴于包含分号的 URL 在浏览器中可以正常工作,您的代码应该支持它们。
http://www.ietf.org/rfc/rfc3986.txt covers URLs and what characters may appear in unencoded form. Given that URLs containing semicolons work properly in browsers, your code should support them.
引用 RFC 对于回答这个问题并没有多大帮助,因为您会遇到带有分号(以及与此相关的逗号)的 URL。 我们有一个不处理分号和逗号的正则表达式,NutshellMail 的一些用户抱怨说,因为包含它们的 URL 实际上存在于野外。 尝试在 Facebook 或 Twitter 中构建一个包含“;”的虚拟 URL 或“,”,您将看到这两个服务正确编码了完整的 URL。
我用以下模式替换了我们正在使用的正则表达式(并测试了它的工作原理):
此正则表达式来自 http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions /(稍作修改)
Quoting RFCs is not all that helpful in answering this question, because you will encounter URLs with semicolons (and commas for that matter). We had a Regex that did not handle semicolons and commas, and some of our users at NutshellMail complained because URLs containing them do in fact exist in the wild. Try building a dummy URL in Facebook or Twitter that contains a ';' or ',' and you will see that those two services encode the full URL properly.
I replaced the Regex we were using with the following pattern (and have tested that it works):
This Regex came from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/ (with a slight modification)