哪些字符会使 URL 无效?
哪些字符会使 URL 无效?
这些是有效的 URL 吗?
-
example.com/file[/].html
-
http://example.com/file[/].html
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
哪些字符会使 URL 无效?
这些是有效的 URL 吗?
example.com/file[/].html
http://example.com/file[/].html
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(13)
一般来说,RFC 3986 定义的 URI(请参阅 第 2 节:字符) 可能包含以下 84 个字符中的任何一个:
请注意,此列表并未说明其中的位置在 URI 中,这些字符可能会出现。
任何其他字符都需要使用百分比编码 (
%
hh
) 进行编码。 URI 的每个部分对于哪些字符需要用百分比编码字表示有进一步的限制。In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following 84 characters:
Note that this list doesn't state where in the URI these characters may occur.
Any other character needs to be encoded with the percent-encoding (
%
hh
). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.本例中的“[”和“]”是“不明智”的字符,但仍然合法。如果 [] 中的“/”是文件名的一部分,那么它是无效的,因为“/”是保留的并且应该正确编码:
为了添加一些说明并直接解决上面的问题,有几个类导致 URL 和 URI 出现问题的字符。
有些字符是不允许出现的,并且永远不应出现在 URL/URI 中、保留字符(如下所述)以及在某些情况下可能会导致问题但被标记为“不明智”或“不安全”的其他字符。 RFC-1738(URL)中清楚地说明了字符受到限制的原因RFC-2396 (URI)。请注意较新的 RFC-3986(更新为 RFC-1738)定义了哪些字符的构造在给定上下文中允许使用,但旧规范提供了以下规则不允许使用哪些字符的更简单、更一般的描述。
排除的 US-ASCII URI 语法中不允许的字符:
排除字符“#”,因为它用于分隔 URI 与片段标识符。百分号字符“%”被排除,因为它用于转义字符的编码。换句话说,“#”和“%”是保留字符,必须在特定上下文中使用。
允许使用不明智的字符列表,但可能会导致问题:
保留的字符在查询组件内和/或在 URI/URL 内具有特殊含义:
上面的“保留”语法类指的是 URI 内允许的字符,但在通用 URI 语法的特定组件内可能不允许的字符。 “保留”集中的字符并非在所有上下文中都保留。例如,主机名可以包含可选的用户名,因此它可以类似于 ftp://user@hostname/ ,其中“@”字符具有特殊含义。
下面是一个包含无效和不明智字符(例如“$”、“[”、“]”)且应正确编码的 URL 示例:
URI 和 URL 的某些字符限制取决于编程语言。例如,“|” (0x7C) 字符虽然在 URI 规范中仅标记为“不明智”,但会在 Java java.net.URI 构造函数中抛出 URISyntaxException,因此类似于
http 的 URL不允许使用 ://api.google.com/q?exp=a|b
,必须编码为http://api.google.com/q?exp=a%7Cb< /code> 如果使用带有 URI 对象实例的 Java。
The '[' and ']' in this example are "unwise" characters but still legal. If the '/' in the []'s is meant to be part of file name then it is invalid since '/' is reserved and should be properly encoded:
To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.
There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.
Excluded US-ASCII Characters disallowed within the URI syntax:
The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.
List of unwise characters are allowed but may cause problems:
Characters that are reserved within a query component and/or have special meaning within a URI/URL:
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like
ftp://user@hostname/
where the '@' character has special meaning.Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:
Some of the character restrictions for URIs and URLs are programming language-dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like
http://api.google.com/q?exp=a|b
is not allowed and must be encoded instead ashttp://api.google.com/q?exp=a%7Cb
if using Java with a URI object instance.这里的大多数现有答案都是不切实际的,因为它们完全忽略了地址的实际使用情况,例如:
首先,离题一下术语。这些地址是什么??它们是有效的 URL 吗?
从历史上看,答案是“不”。根据 RFC 3986,从 2005 年开始,此类地址不是 URI(因此不是 URL,因为 URL 是一种 URI)。根据 2005 IETF 标准的术语,我们应该正确地将它们称为 IRI(国际化资源标识符),如 中的定义RFC 3987,从技术上讲它们不是 URI,但只需对 IRI 中的所有非 ASCII 字符进行百分比编码即可将其转换为 URI。
根据现代规范,答案是“是”。 WHATWG Living Standard 只是将以前称为“URI”或“IRIs”的所有内容分类为“URL” 。这使规范的术语与未阅读规范的普通人使用“URL”一词的方式保持一致,“URL”是规范的 目标。
WHATWG 生活标准允许使用哪些字符?
根据“URL”的新含义,允许使用哪些字符?在 URL 的许多部分,例如查询字符串和路径,我们可以使用任意 “URL 单元”,它们是
什么是“URL 代码点”?
(请注意,“URL 代码点”列表不包括
%
,但如果%
是 URL 代码单元的一部分,则允许在“URL 代码单元”中使用%
。百分比编码序列。)我唯一能发现规范允许使用此集合中不的任何字符的地方是在 host,其中 IPv6 地址包含在
[
和]
字符中。 URL 中的其他任何地方都允许使用 URL 单位或一些限制性更强的字符集。旧 RFC 允许使用哪些字符?
为了历史的缘故,并且由于这里的答案中的其他地方没有充分探讨它,所以让我们检查一下旧的规范下是否允许。
首先,我们有两种类型的 RFC 3986 保留字符:
:/?#[]@
,它们是 RFC 3986 中定义的 URI 通用语法的一部分!$&'()*+ ,;=
,它们不是 RFC 通用语法的一部分,但保留用作特定 URI 方案的语法组件。例如,分号和逗号用作 数据 URI 和语法的一部分&
和=
用作查询字符串中无处不在的?foo=bar&qux=baz
格式的一部分(这不是< /em> 由 RFC 3986 指定)。上面的任何保留字符都可以在 URI 中合法使用,无需编码,无论是为了满足其语法目的,还是在某些地方作为数据中的文字字符,这种使用不会被误解为满足其语法目的的字符。 (例如,尽管
/
在 URL 中具有语法含义,但您可以在查询字符串中使用未编码的它,因为它在查询字符串中没有含义。)始终可以简单地用于表示不带任何编码的数据:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~
RFC 3986 还指定了一些非保留字符,这些字符
%
字符本身可以用于百分比编码。这样就只剩下以下禁止出现在 URL 中的 ASCII 字符:
"<>^`{|}
ASCII 中的所有其他字符都可以合法地出现在 URL 中。
然后 RFC 3987 使用以下 unicode 字符扩展该组非保留字符范围:
考虑到最新的 Unicode ,旧规范中的这些块选择似乎很奇怪且任意块定义;这可能是因为自 RFC 3987 编写以来的十年间已添加了这些块。
最后,也许值得注意的是,仅仅知道哪些字符可以合法地出现在 URL 中并不足以识别是否可以。某些给定的字符串是否是合法的 URL,因为某些字符仅在 URL 的特定部分是合法的,例如,保留字符
[
和]
作为一部分是合法的。像 http://[1080::8:800:200C:417A]/foo 这样的 URL 中的 IPv6 文字主机,但在任何其他上下文中都不合法,因此 OP 的示例http://example .com/file[/].html
是非法的。Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:
First, a digression into terminology. What are these addresses? Are they valid URLs?
Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.
Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.
What characters are allowed under the WHATWG Living Standard?
Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are
What are "URL code points"?
(Note that the list of "URL code points" doesn't include
%
, but that%
s are allowed in "URL code units" if they're part of a percent-encoding sequence.)The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in
[
and]
characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.What characters were allowed under the old RFCs?
For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.
First of all, we have two types of RFC 3986 reserved characters:
:/?#[]@
, which are part of the generic syntax for a URI defined in RFC 3986!$&'()*+,;=
, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and&
and=
are used as part of the ubiquitous?foo=bar&qux=baz
format in query strings (which isn't specified by RFC 3986).Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although
/
has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~
Finally, the
%
character itself is allowed for percent-encodings.That leaves only the following ASCII characters that are forbidden from appearing in a URL:
"<>^`{|}
Every other character from ASCII can legally feature in a URL.
Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:
These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.
Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters
[
and]
are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example ofhttp://example.com/file[/].html
is illegal.在补充问题中,您询问
www.example.com/file[/].html
是否是有效的 URL。该 URL 无效,因为 URL 是 URI 的一种类型,有效的 URI 必须具有类似
http:
的方案(请参阅 RFC 3986)。如果您想询问
http://www.example.com/file[/].html
是否是有效的 URL,那么答案仍然是否定的,因为其中的方括号字符无效。方括号字符是为以下格式的 URL 保留的:
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
(即 IPv6 文字而不是主机) name)如果您想完全理解这个问题,值得仔细阅读 RFC 3986。
In your supplementary question you asked if
www.example.com/file[/].html
is a valid URL.That URL isn't valid because a URL is a type of URI and a valid URI must have a scheme like
http:
(see RFC 3986).If you meant to ask if
http://www.example.com/file[/].html
is a valid URL then the answer is still no because the square bracket characters aren't valid there.The square bracket characters are reserved for URLs in this format:
http://[2001:db8:85a3::8a2e:370:7334]/foo/bar
(i.e. an IPv6 literal instead of a host name)It's worth reading RFC 3986 carefully if you want to understand the issue fully.
可以在 URI 中使用的所有有效字符(URL是一种 URI),在 RFC 3986。
所有其他字符都可以在 URL 中使用,前提是它们首先经过“URL 编码”。这涉及更改特定“代码”的无效字符(通常采用百分号 (%) 后跟十六进制数字的形式)。
此链接 HTML URL 编码参考包含无效字符的编码列表。
All valid characters that can be used in a URI (a URL is a type of URI) are defined in RFC 3986.
All other characters can be used in a URL provided that they are "URL Encoded" first. This involves changing the invalid character for specific "codes" (usually in the form of the percent symbol (%) followed by a hexadecimal number).
This link, HTML URL Encoding Reference, contains a list of the encodings for invalid characters.
一些 Unicode 字符范围是有效的 HTML5,尽管使用它们可能仍然不是一个好主意。
例如,
href
文档说 http: //www.w3.org/TR/html5/links.html#attr-hyperlink-href:那么“有效 URL”的定义就指向 http://url.spec.whatwg.org/,其目标是:
该文档将 URL 代码点定义为:
然后在语句中使用术语“URL 代码点”:
在解析算法的几个部分中,包括架构、权限、相对路径、查询和片段状态:所以基本上是整个 URL。
此外,验证器 http://validator.w3.org/ 会传递类似
"you 的 URL好"
,并且不会传递带有空格等字符的 URL"a b"
当然,正如 Stephen C 提到的,这不仅与字符有关,还与上下文有关:你必须了解整个算法。但由于“URL 代码点”类用于算法的关键点,因此可以很好地了解可以使用什么或不可以使用什么。
另请参阅:网址中的 Unicode 字符
Several of Unicode character ranges are valid HTML5, although it might still not be a good idea to use them.
E.g.,
href
docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which says it aims to:
That document defines URL code points as:
The term "URL code points" is then used in the statement:
in a several parts of the parsing algorithm, including the schema, authority, relative path, query and fragment states: so basically the entire URL.
Also, the validator http://validator.w3.org/ passes for URLs like
"你好"
, and does not pass for URLs with characters like spaces"a b"
Of course, as mentioned by Stephen C, it is not just about characters but also about context: you have to understand the entire algorithm. But since class "URL code points" is used on key points of the algorithm, it that gives a good idea of what you can use or not.
See also: Unicode characters in URLs
我需要选择字符来分割字符串中的 URL,因此我决定创建一个自己无法在 URL 中找到的字符列表:
因此,可能的选择是换行符、制表符、空格、反斜杠和
“<>{}^|
。我想我会使用空格或换行符。:)I needed to select characters to split URLs in a string, so I decided to create a list of characters which could not be found in the URL by myself:
So, the possible choices are the newline, tab, space, backslash and
"<>{}^|
. I guess I'll go with the space or newline. :)我正在实现一个旧的 HTTP(0.9、1.0、1.1)请求和响应读取器/写入器。请求URI是最有问题的地方。
您不能只按原样使用 RFC 1738、2396 或 3986。有许多旧的 HTTP 客户端和服务器允许更多字符。因此,我根据意外发布的 Web 服务器访问日志进行了研究:
"GET URI HTTP/1.0" 200
。我发现 URI 中经常使用以下非标准字符:
这些字符在 RFC 1738 中被描述为不安全。
如果您希望与所有旧的 HTTP 客户端和服务器兼容 - 您必须在请求 URI 中允许这些字符。
请在 oghttp-request-collector 中阅读有关此研究的更多信息。
I am implementing an old HTTP (0.9, 1.0, 1.1) request and response reader/writer. The request URI is the most problematic place.
You can't just use RFC 1738, 2396 or 3986 as it is. There are many old HTTP clients and servers that allow more characters. So I've made research based on accidentally published web server access logs:
"GET URI HTTP/1.0" 200
.I've found that the following non-standard characters are often used in URIs:
These characters were described in RFC 1738 as unsafe.
If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in the request URI.
Please read more information about this research in oghttp-request-collector.
这并不是您问题的真正答案,但验证 URL 确实是一个严重的问题。您可能最好验证域名并保留 URL 的查询部分。这是我的经验。
您还可以对 URL 进行 ping 操作,看看是否会产生有效的响应,但这对于这样一个简单的任务来说可能太过了。
检测 URL 的正则表达式很丰富,google 一下:)
This is not really an answer to your question, but validating URLs is really a serious p.i.t.a. You're probably just better off validating the domain name and leave query part of the URL be. That is my experience.
You could also resort to pinging the URL and seeing if it results in a valid response, but that might be too much for such a simple task.
Regular expressions to detect URLs are abundant, google it :)
从来源(需要时添加强调):
From the source (emphasis added when needed):
我无法对上述答案发表评论,但想强调一点(在另一个答案中),即不允许在任何地方使用字符。例如,域名不能包含下划线,因此 http://test_url.com 无效。
I can't comment on the above answers, but wanted to emphasize the point (in another answer) that allowed characters aren't allowed everywhere. For example, domain names can't have underscores, so http://test_url.com is invalid.
如果您需要进行更广泛的验证,包括表情符号(现在在 URL 中偶尔使用),例如:
http://factmyth.com/factoids/you-
If you need to have a broader validation that includes emojis (that are used nowadays sporadically in URLS), for example :
http://factmyth.com/factoids/you-????-can-????-put-????-emojis-????-in-????-urls-????/
And even in domain names like : ????.tld
Then this is a useful regex :
PS : It is not valid for all regex "flavors", used in programming languages. It will be valid for Python, Rust, Golang, modern Javascript, but not for PHP for example. Check here by selecting "flavors" on the left and checking for error messages : https://regex101.com/
我想出了几个 PHP 正则表达式,可以将文本中的 URL 转换为锚标记。 (首先,它将所有 www. URL 转换为 http://,然后将所有带有 https?:// 的 URL 转换为 >href=... HTML 链接
$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a- z~%]+)/sim', '$2', preg_replace('/(\s)((www\.)([!#$& ;-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );
I came up with a couple of regular expressions for PHP that will convert URLs in text to anchor tags. (First it converts all www. URLs to http://, and then converts all URLs with https?:// to a href=... HTML links
$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a-z~%]+)/sim', '<a href="$1$2">$2</a>', preg_replace('/(\s)((www\.)([!#$&-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );