哪些字符会使 URL 无效？

内心荒芜 2024-08-14 02:49:50

一般来说，RFC 3986 定义的 URI（请参阅第 2 节：字符) 可能包含以下 84 个字符中的任何一个：

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!amp;'()*+,;=

请注意，此列表并未说明其中的位置在 URI 中，这些字符可能会出现。

任何其他字符都需要使用百分比编码 (%hh) 进行编码。 URI 的每个部分对于哪些字符需要用百分比编码字表示有进一步的限制。

In general URIs as defined by RFC 3986 (see Section 2: Characters) may contain any of the following 84 characters:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!amp;'()*+,;=

Note that this list doesn't state where in the URI these characters may occur.

Any other character needs to be encoded with the percent-encoding (%hh). Each part of the URI has further restrictions about what characters need to be represented by an percent-encoded word.

回复收藏 0 原文

锦欢 2024-08-14 02:49:50

本例中的“[”和“]”是“不明智”的字符，但仍然合法。如果 [] 中的“/”是文件名的一部分，那么它是无效的，因为“/”是保留的并且应该正确编码：

http://example.com/file[/].html

为了添加一些说明并直接解决上面的问题，有几个类导致 URL 和 URI 出现问题的字符。

有些字符是不允许出现的，并且永远不应出现在 URL/URI 中、保留字符（如下所述）以及在某些情况下可能会导致问题但被标记为“不明智”或“不安全”的其他字符。 RFC-1738（URL）中清楚地说明了字符受到限制的原因RFC-2396 (URI)。请注意较新的 RFC-3986（更新为 RFC-1738）定义了哪些字符的构造在给定上下文中允许使用，但旧规范提供了以下规则不允许使用哪些字符的更简单、更一般的描述。

排除的 US-ASCII URI 语法中不允许的字符：

   control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
   space       = <US-ASCII coded character 20 hexadecimal>
   delims      = "<" | ">" | "#" | "%" | <">

排除字符“#”，因为它用于分隔 URI 与片段标识符。百分号字符“%”被排除，因为它用于转义字符的编码。换句话说，“#”和“%”是保留字符，必须在特定上下文中使用。

允许使用不明智的字符列表，但可能会导致问题：

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

保留的字符在查询组件内和/或在 URI/URL 内具有特殊含义：

  reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

上面的“保留”语法类指的是 URI 内允许的字符，但在通用 URI 语法的特定组件内可能不允许的字符。 “保留”集中的字符并非在所有上下文中都保留。例如，主机名可以包含可选的用户名，因此它可以类似于 ftp://user@hostname/ ，其中“@”字符具有特殊含义。

下面是一个包含无效和不明智字符（例如“$”、“[”、“]”）且应正确编码的 URL 示例：

http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg

URI 和 URL 的某些字符限制取决于编程语言。例如，“|” (0x7C) 字符虽然在 URI 规范中仅标记为“不明智”，但会在 Java java.net.URI 构造函数中抛出 URISyntaxException，因此类似于 http 的 URL不允许使用 ://api.google.com/q?exp=a|b，必须编码为 http://api.google.com/q?exp=a%7Cb< /code> 如果使用带有 URI 对象实例的 Java。

The '[' and ']' in this example are "unwise" characters but still legal. If the '/' in the []'s is meant to be part of file name then it is invalid since '/' is reserved and should be properly encoded:

http://example.com/file[/].html

To add some clarification and directly address the question above, there are several classes of characters that cause problems for URLs and URIs.

There are some characters that are disallowed and should never appear in a URL/URI, reserved characters (described below), and other characters that may cause problems in some cases, but are marked as "unwise" or "unsafe". Explanations for why the characters are restricted are clearly spelled out in RFC-1738 (URLs) and RFC-2396 (URIs). Note the newer RFC-3986 (update to RFC-1738) defines the construction of what characters are allowed in a given context but the older spec offers a simpler and more general description of which characters are not allowed with the following rules.

Excluded US-ASCII Characters disallowed within the URI syntax:

   control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
   space       = <US-ASCII coded character 20 hexadecimal>
   delims      = "<" | ">" | "#" | "%" | <">

The character "#" is excluded because it is used to delimit a URI from a fragment identifier. The percent character "%" is excluded because it is used for the encoding of escaped characters. In other words, the "#" and "%" are reserved characters that must be used in a specific context.

List of unwise characters are allowed but may cause problems:

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Characters that are reserved within a query component and/or have special meaning within a URI/URL:

  reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "quot; | ","

The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax. Characters in the "reserved" set are not reserved in all contexts. The hostname, for example, can contain an optional username so it could be something like ftp://user@hostname/ where the '@' character has special meaning.

Here is an example of a URL that has invalid and unwise characters (e.g. '$', '[', ']') and should be properly encoded:

http://mw1.google.com/mw-earth-vectordb/kml-samples/gp/seattle/gigapxl/$[level]/r$[y]_c$[x].jpg

Some of the character restrictions for URIs and URLs are programming language-dependent. For example, the '|' (0x7C) character although only marked as "unwise" in the URI spec will throw a URISyntaxException in the Java java.net.URI constructor so a URL like http://api.google.com/q?exp=a|b is not allowed and must be encoded instead as http://api.google.com/q?exp=a%7Cb if using Java with a URI object instance.

回复收藏 0 原文

魔 2024-08-14 02:49:50

这里的大多数现有答案都是不切实际的，因为它们完全忽略了地址的实际使用情况，例如：

首先，离题一下术语。这些地址是什么？？它们是有效的 URL 吗？

从历史上看，答案是“不”。根据 RFC 3986，从 2005 年开始，此类地址不是 URI（因此不是 URL，因为 URL 是一种 URI）。根据 2005 IETF 标准的术语，我们应该正确地将它们称为 IRI（国际化资源标识符），如中的定义RFC 3987，从技术上讲它们不是 URI，但只需对 IRI 中的所有非 ASCII 字符进行百分比编码即可将其转换为 URI。

根据现代规范，答案是“是”。 WHATWG Living Standard 只是将以前称为“URI”或“IRIs”的所有内容分类为“URL” 。这使规范的术语与未阅读规范的普通人使用“URL”一词的方式保持一致，“URL”是规范的目标。

WHATWG 生活标准允许使用哪些字符？

根据“URL”的新含义，允许使用哪些字符？在 URL 的许多部分，例如查询字符串和路径，我们可以使用任意 “URL 单元”，它们是

URL 代码点和百分比编码字节。

什么是“URL 代码点”？

URL 代码点为 ASCII 字母数字、U+0021 (!)、U+0024 ($)、U+0026 (&)、U+0027 (')、U+0028左括号、U+0029 右括号、U+002A (*)、U+002B (+)、U+002C (,)、U+002D (-)、U+002E (.)、U+002F (/) 、U+003A (:)、U+003B (;)、U+003D (=)、U+003F (?)、U+0040 (@)、U+005F (_)、U+007E (~)、以及 U+00A0 到 U+10FFFD 范围内的代码点（含），不包括代理项和非字符。

（请注意，“URL 代码点”列表不包括 %，但如果 % 是 URL 代码单元的一部分，则允许在“URL 代码单元”中使用 %。百分比编码序列。）

我唯一能发现规范允许使用此集合中不的任何字符的地方是在 host，其中 IPv6 地址包含在 [ 和 ] 字符中。 URL 中的其他任何地方都允许使用 URL 单位或一些限制性更强的字符集。

旧 RFC 允许使用哪些字符？

为了历史的缘故，并且由于这里的答案中的其他地方没有充分探讨它，所以让我们检查一下旧的规范下是否允许。

首先，我们有两种类型的 RFC 3986 保留字符：

< strong>:/?#[]@，它们是 RFC 3986 中定义的 URI 通用语法的一部分
!$&'()*+ ,;=，它们不是 RFC 通用语法的一部分，但保留用作特定 URI 方案的语法组件。例如，分号和逗号用作数据 URI 和 语法的一部分& 和 = 用作查询字符串中无处不在的 ?foo=bar&qux=baz 格式的一部分（这不是< /em> 由 RFC 3986 指定）。

上面的任何保留字符都可以在 URI 中合法使用，无需编码，无论是为了满足其语法目的，还是在某些地方作为数据中的文字字符，这种使用不会被误解为满足其语法目的的字符。（例如，尽管 / 在 URL 中具有语法含义，但您可以在查询字符串中使用未编码的它，因为它在查询字符串中没有含义。）

始终可以简单地用于表示不带任何编码的数据：

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~

RFC 3986 还指定了一些非保留字符，这些字符 % 字符本身可以用于百分比编码。

这样就只剩下以下禁止出现在 URL 中的 ASCII 字符：

控制字符（字符 0-1F 和 7F），包括换行符、制表符和回车符。
"<>^`{|}

ASCII 中的所有其他字符都可以合法地出现在 URL 中。

然后 RFC 3987 使用以下 unicode 字符扩展该组非保留字符范围：

  %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD

考虑到最新的 Unicode ，旧规范中的这些块选择似乎很奇怪且任意块定义；这可能是因为自 RFC 3987 编写以来的十年间已添加了这些块。

最后，也许值得注意的是，仅仅知道哪些字符可以合法地出现在 URL 中并不足以识别是否可以。某些给定的字符串是否是合法的 URL，因为某些字符仅在 URL 的特定部分是合法的，例如，保留字符 [ 和 ] 作为一部分是合法的。像 http://[1080::8:800:200C:417A]/foo 这样的 URL 中的 IPv6 文字主机，但在任何其他上下文中都不合法，因此 OP 的示例 http://example .com/file[/].html 是非法的。

Most of the existing answers here are impractical because they totally ignore the real-world usage of addresses like:

First, a digression into terminology. What are these addresses? Are they valid URLs?

Historically, the answer was "no". According to RFC 3986, from 2005, such addresses are not URIs (and therefore not URLs, since URLs are a type of URIs). Per the terminology of 2005 IETF standards, we should properly call them IRIs (Internationalized Resource Identifiers), as defined in RFC 3987, which are technically not URIs but can be converted to URIs simply by percent-encoding all non-ASCII characters in the IRI.

Per modern spec, the answer is "yes". The WHATWG Living Standard simply classifies everything that would previously be called "URIs" or "IRIs" as "URLs". This aligns the specced terminology with how normal people who haven't read the spec use the word "URL", which was one of the spec's goals.

What characters are allowed under the WHATWG Living Standard?

Per this newer meaning of "URL", what characters are allowed? In many parts of the URL, such as the query string and path, we're allowed to use arbitrary "URL units", which are

URL code points and percent-encoded bytes.

What are "URL code points"?

The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.

(Note that the list of "URL code points" doesn't include %, but that %s are allowed in "URL code units" if they're part of a percent-encoding sequence.)

The only place I can spot where the spec permits the use of any character that's not in this set is in the host, where IPv6 addresses are enclosed in [ and ] characters. Everywhere else in the URL, either URL units are allowed or some even more restrictive set of characters.

What characters were allowed under the old RFCs?

For the sake of history, and since it's not explored fully elsewhere in the answers here, let's examine was allowed under the older pair of specs.

First of all, we have two types of RFC 3986 reserved characters:

:/?#[]@, which are part of the generic syntax for a URI defined in RFC 3986
!$&'()*+,;=, which aren't part of the RFC's generic syntax, but are reserved for use as syntactic components of particular URI schemes. For instance, semicolons and commas are used as part of the syntax of data URIs, and & and = are used as part of the ubiquitous ?foo=bar&qux=baz format in query strings (which isn't specified by RFC 3986).

Any of the reserved characters above can be legally used in a URI without encoding, either to serve their syntactic purpose or just as literal characters in data in some places where such use could not be misinterpreted as the character serving its syntactic purpose. (For example, although / has syntactic meaning in a URL, you can use it unencoded in a query string, because it doesn't have meaning in a query string.)

RFC 3986 also specifies some unreserved characters, which can always be used simply to represent data without any encoding:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~

Finally, the % character itself is allowed for percent-encodings.

That leaves only the following ASCII characters that are forbidden from appearing in a URL:

The control characters (chars 0-1F and 7F), including new line, tab, and carriage return.
"<>^`{|}

Every other character from ASCII can legally feature in a URL.

Then RFC 3987 extends that set of unreserved characters with the following unicode character ranges:

  %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
/ %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
/ %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
/ %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
/ %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
/ %xD0000-DFFFD / %xE1000-EFFFD

These block choices from the old spec seem bizarre and arbitrary given the latest Unicode block definitions; this is probably because the blocks have been added to in the decade since RFC 3987 was written.

Finally, it's perhaps worth noting that simply knowing which characters can legally appear in a URL isn't sufficient to recognise whether some given string is a legal URL or not, since some characters are only legal in particular parts of the URL. For example, the reserved characters [ and ] are legal as part of an IPv6 literal host in a URL like http://[1080::8:800:200C:417A]/foo but aren't legal in any other context, so the OP's example of http://example.com/file[/].html is illegal.

回复收藏 0 原文

情绪 2024-08-14 02:49:50

在补充问题中，您询问 www.example.com/file[/].html 是否是有效的 URL。

该 URL 无效，因为 URL 是 URI 的一种类型，有效的 URI 必须具有类似 http: 的方案（请参阅 RFC 3986)。

如果您想询问 http://www.example.com/file[/].html 是否是有效的 URL，那么答案仍然是否定的，因为其中的方括号字符无效。

方括号字符是为以下格式的 URL 保留的：http://[2001:db8:85a3::8a2e:370:7334]/foo/bar（即 IPv6 文字而不是主机） name)

如果您想完全理解这个问题，值得仔细阅读 RFC 3986。

回复收藏 0 原文

∞觅青森が 2024-08-14 02:49:50

可以在 URI 中使用的所有有效字符（URL是一种 URI），在 RFC 3986。

所有其他字符都可以在 URL 中使用，前提是它们首先经过“URL 编码”。这涉及更改特定“代码”的无效字符（通常采用百分号 (%) 后跟十六进制数字的形式）。

此链接 HTML URL 编码参考包含无效字符的编码列表。

回复收藏 0 原文

木有鱼丸 2024-08-14 02:49:50

一些 Unicode 字符范围是有效的 HTML5，尽管使用它们可能仍然不是一个好主意。

例如， href 文档说 http: //www.w3.org/TR/html5/links.html#attr-hyperlink-href：

a 和 area 元素上的 href 属性必须具有一个值，该值是可能被空格包围的有效 URL。

那么“有效 URL”的定义就指向 http://url.spec.whatwg.org/，其目标是：

使 RFC 3986 和 RFC 3987 与当代实现保持一致，并在此过程中废弃它们。

该文档将 URL 代码点定义为：

ASCII 字母数字、“!”、“$”、“&”、“'”、“(”、“)”、“*”、“+”、“,”、“-”、“.” 、“/”、“:”、“;”、“=”、“?”、“@”、“_”、“~”以及 U+00A0 到 U+D7FF、U+E000 范围内的代码点至 U+FDCF、U+FDF0 至 U+FFFD、U+10000 至 U+1FFFD、U+20000 至 U+2FFFD、U+30000 至 U+3FFFD、U+40000 至 U+4FFFD、U+50000 至 U +5FFFD、U+60000 至 U+6FFFD、U+70000 至 U+7FFFD、U+80000 至 U+8FFFD、U+90000 至 U+9FFFD、U+A0000 至 U+AFFFD、U+B0000 至 U+BFFFD 、U+C0000 至 U+CFFFD、U+D0000 至 U+DFFFD、U+E1000 至 U+EFFFD、U+F0000 至 U+FFFFD、U+100000 至 U+10FFFD。

然后在语句中使用术语“URL 代码点”：

如果 c 不是 URL 代码点且不是“%”，则解析错误。

在解析算法的几个部分中，包括架构、权限、相对路径、查询和片段状态：所以基本上是整个 URL。

此外，验证器 http://validator.w3.org/ 会传递类似 "you 的 URL好"，并且不会传递带有空格等字符的 URL "a b"

当然，正如 Stephen C 提到的，这不仅与字符有关，还与上下文有关：你必须了解整个算法。但由于“URL 代码点”类用于算法的关键点，因此可以很好地了解可以使用什么或不可以使用什么。

另请参阅：网址中的 Unicode 字符

回复收藏 0 原文

半城柳色半声笛 2024-08-14 02:49:50

我需要选择字符来分割字符串中的 URL，因此我决定创建一个自己无法在 URL 中找到的字符列表：

>>> allowed = "-_.~!*'();:@&=+$,/?%#[]?@ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'

因此，可能的选择是换行符、制表符、空格、反斜杠和 “<>{}^|。我想我会使用空格或换行符。:)

I needed to select characters to split URLs in a string, so I decided to create a list of characters which could not be found in the URL by myself:

>>> allowed = "-_.~!*'();:@&=+$,/?%#[]?@ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
>>> from string import printable
>>> ''.join(set(printable).difference(set(allowed)))
'`" <\x0b\n\r\x0c\\\t{^}|>'

So, the possible choices are the newline, tab, space, backslash and "<>{}^|. I guess I'll go with the space or newline. :)

回复收藏 0 原文

孤芳又自赏 2024-08-14 02:49:50

我正在实现一个旧的 HTTP（0.9、1.0、1.1）请求和响应读取器/写入器。请求URI是最有问题的地方。

您不能只按原样使用 RFC 1738、2396 或 3986。有许多旧的 HTTP 客户端和服务器允许更多字符。因此，我根据意外发布的 Web 服务器访问日志进行了研究："GET URI HTTP/1.0" 200。

我发现 URI 中经常使用以下非标准字符：

\ { } < > | ` ^ "

这些字符在 RFC 1738 中被描述为不安全。

如果您希望与所有旧的 HTTP 客户端和服务器兼容 - 您必须在请求 URI 中允许这些字符。

请在 oghttp-request-collector 中阅读有关此研究的更多信息。

I am implementing an old HTTP (0.9, 1.0, 1.1) request and response reader/writer. The request URI is the most problematic place.

You can't just use RFC 1738, 2396 or 3986 as it is. There are many old HTTP clients and servers that allow more characters. So I've made research based on accidentally published web server access logs: "GET URI HTTP/1.0" 200.

I've found that the following non-standard characters are often used in URIs:

\ { } < > | ` ^ "

These characters were described in RFC 1738 as unsafe.

If you want to be compatible with all old HTTP clients and servers - you have to allow these characters in the request URI.

Please read more information about this research in oghttp-request-collector.

回复收藏 0 原文

铜锣湾横着走 2024-08-14 02:49:50

这并不是您问题的真正答案，但验证 URL 确实是一个严重的问题。您可能最好验证域名并保留 URL 的查询部分。这是我的经验。

您还可以对 URL 进行 ping 操作，看看是否会产生有效的响应，但这对于这样一个简单的任务来说可能太过了。

检测 URL 的正则表达式很丰富，google 一下:)

回复收藏 0 原文

仅此而已 2024-08-14 02:49:50

从来源（需要时添加强调）：

不安全：

出于多种原因，角色可能不安全。空格字符是不安全的，因为当 URL 被转录、排版或接受文字处理程序处理时，重要的空格可能会消失，而无关紧要的空格可能会被引入。

字符“<”和“>”不安全，因为它们被用作
自由文本中 URL 周围的分隔符；引号（“””）用于
在某些系统中分隔 URL。 字符“#”不安全，应该
总是被编码，因为它被用于万维网和其他
系统将 URL 与片段/锚点标识符分隔开来，该标识符可能
跟随它。 字符“%”不安全，因为它用于
其他字符的编码。 其他字符不安全因为
已知网关和其他传输代理有时会修改此类
人物。这些字符是“{”、“}”、“|”、“”、“^”、“~”、“[”、
“]”和“`”。

所有不安全字符必须始终在 URL 中进行编码。为了
例如，即使在系统中，字符“#”也必须在 URL 中进行编码
通常不处理片段或锚标识符，因此
如果该 URL 被复制到另一个使用它们的系统中，它将
无需更改 URL 编码。
"="">来源

回复收藏 0 原文

憧憬巴黎街头的黎明 2024-08-14 02:49:50

我无法对上述答案发表评论，但想强调一点（在另一个答案中），即不允许在任何地方使用字符。例如，域名不能包含下划线，因此 http://test_url.com 无效。

回复收藏 0 原文

演出会有结束 2024-08-14 02:49:50

如果您需要进行更广泛的验证，包括表情符号（现在在 URL 中偶尔使用），例如：

http://factmyth.com/factoids/you-

If you need to have a broader validation that includes emojis (that are used nowadays sporadically in URLS), for example :

http://factmyth.com/factoids/you-????-can-????-put-????-emojis-????-in-????-urls-????/

And even in domain names like : ????.tld

Then this is a useful regex :

[-a-zA-Z0-9\u1F60-\uFFFF@:%_\+.~#?&//=!'(),;*\$\[\]]*

PS : It is not valid for all regex "flavors", used in programming languages. It will be valid for Python, Rust, Golang, modern Javascript, but not for PHP for example. Check here by selecting "flavors" on the left and checking for error messages : https://regex101.com/

回复收藏 0 原文

来日方长 2024-08-14 02:49:50

我想出了几个 PHP 正则表达式，可以将文本中的 URL 转换为锚标记。（首先，它将所有 www. URL 转换为 http://，然后将所有带有 https?:// 的 URL 转换为 >href=... HTML 链接

$string = preg_replace('/(https?:\/\/)([!#$&-;=?\-\[\]_a- z~%]+)/sim', '$2', preg_replace('/(\s)((www\.)([!#$& ;-;=?\-\[\]_a-z~%]+))/sim', '$1http://$2', $string) );