背景(下面的问题)
我一直在谷歌上搜索这个来回阅读RFC和SO问题试图解决这个问题,但我仍然没有得到杰克。
所以我想我们只是投票选出“最佳”答案,仅此而已,或者?
基本上可以归结为这一点。
3.4。查询组件
查询组件是由资源解释的一串信息。
查询 = *uric
在查询组件中,字符“;”、“/”、“?”、“:”、“@”、“&”、“=”、“+”、“,”和“$” “ 已保留。
首先让我困惑的是 *uric 是这样定义的
uric = 保留 |毫无保留|逃脱
<代码>保留=“;” | “/” | “?” | “:”| “@”| “&” | “=” | “+”| “$” | “,”
,诸如此类的段落在一定程度上澄清了这一点
上面的“保留”语法类是指 URI 中允许的字符,但通用 URI 语法的特定组件中可能不允许的字符;它们用作第 3 节中描述的组件的分隔符。
“保留”集中的字符并非在所有上下文中都被保留。任何给定 URI 组件中实际保留的字符集由该组件定义。一般来说,如果 URI 的语义发生变化,并且该字符被转义的 US-ASCII 编码替换,则该字符将被保留。
最后的摘录感觉有些倒退,但它清楚地表明保留字符集取决于上下文。然而 3.4 规定所有保留字符都保留在查询组件中,但是,这里唯一会改变语义的是转义问号 (?),因为 URI 不定义查询字符串的概念。
此时我已经完全放弃了 RFC,但发现 RFC 1738 特别有趣。
HTTP URL 采用以下形式:
http://<主机>:<端口>/<路径>?<搜索部分>
在<路径>内和 <搜索部分>成分, ”/”, ”;”, ”?”被保留。 “/”字符可以在 HTTP 中使用来指定层次结构。
我至少在 RFC 1738 取代 RFC 2396 的 HTTP URL 方面对此进行解释。因为 URI 查询没有查询字符串的概念,所以保留的解释实际上不允许我像以前那样定义查询字符串现在正在做。
问题
这一切都是在我想将数字列表与另一个资源的请求一起传递时开始的。我没有多想,只是将其作为逗号分隔值传递。令我惊讶的是,逗号被转义了。查询 page.html?q=1,2,3
编码变成 page.html?q=1%2C2%2C3
它可以工作,但它很难看并且没有'不要指望它。就在那时我开始查看 RFC。
我的第一个问题很简单,编码逗号真的有必要吗?
我的回答,根据RFC 2396:是,根据RFC 1738:否
后来我找到了关于请求之间传递列表的相关帖子。 csv 方法本来就很糟糕。相反,它出现了(以前没有见过)。
page.html?q=1;q=2;q=3
我的第二个问题,这是一个有效的网址吗?
我的答案,根据 RFC 2396:不,根据 RFC 1738:否(; 保留)
只要它是数字,我在传递 csv 时没有任何问题,但是是的,您确实遇到了必须编码和的风险如果其他东西突然需要逗号,则来回解码值。不管怎样,我尝试了 ASP.NET 的分号查询字符串,但结果不是我所期望的。
Default.aspx?a=1;a=2&b=1&a=3
Request.QueryString["a"] = "1;a=2,3"
Request.QueryString["b"] = "1"
我看不出这与 csv 方法有何显着不同,因为当我要求“a”时,我得到一个带有逗号的字符串。 ASP.NET 当然不是一个参考实现,但它还没有让我失望。
但最重要的是——我的第三个问题——这方面的规范在哪里?你会做什么或不做什么?
Background (question further down)
I've been Googling this back and forth reading RFCs and SO questions trying to crack this, but I still don't got jack.
So I guess we just vote for the "best" answer and that's it, or?
Basically it boils down to this.
3.4. Query Component
The query component is a string of information to be interpreted by the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.
The first thing that boggles me is that *uric is defined like this
uric = reserved | unreserved | escaped
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
This is however somewhat clarified by paragraphs such as
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax; they are used as delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding.
This last excerpt feels somewhat backwards, but it clearly states that the reserved character set depends on context. Yet 3.4 states that all the reserved characters are reserved within a query component, however, the only things that would change the semantics here is escaping the question mark (?) as URIs do not define the concept of a query string.
At this point I've given up on the RFCs entirely but found RFC 1738 particularly interesting.
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
Within the <path> and <searchpart> components, "/", ";", "?" are reserved. The "/" character may be used within HTTP to designate a hierarchical structure.
I interpret this at least with regards to HTTP URLs that RFC 1738 supersedes RFC 2396. Because the URI query has no notion of a query string also the interpretation of reserved doesn't really let allow me to define query strings as I'm used to doing by now.
Question
This all started when I wanted to pass a list of numbers together with the request of another resource. I didn't think much of it, and just passed it as a comma separated values. To my surprise though the comma was escaped. The query page.html?q=1,2,3
encoded turned into page.html?q=1%2C2%2C3
it works, but it's ugly and didn't expect it. That's when I started going through RFCs.
My first question is simply, is encoding commas really necessary?
My answer, according to RFC 2396: yes, according to RFC 1738: no
Later I found related posts regarding the passing of lists between requests. Where the csv approach was poised as bad. This showed up instead, (haven't seen this before).
page.html?q=1;q=2;q=3
My second question, is this a valid URL?
My answer, according to RFC 2396: no, according to RFC 1738: no (; is reserved)
I don't have any issues with passing csv as long as it's numbers, but yes you do run into the risk of having to encode and decode values back and forth if the comma suddenly is needed for something else. Anyway I tried the semi-colon query string thing with ASP.NET and the result was not what I expected.
Default.aspx?a=1;a=2&b=1&a=3
Request.QueryString["a"] = "1;a=2,3"
Request.QueryString["b"] = "1"
I fail to see how this greatly differs from a csv approach as when I ask for "a" I get a string with commas in it. ASP.NET certainly is not a reference implementation but it hasn't let me down yet.
But most importantly -- my third question -- where is specification for this? and what would you do or for that matter not do?
发布评论
评论(7)
通用 URL 组件中保留的字符并不意味着当它出现在组件中或组件中的数据中时必须对其进行转义。该字符还必须定义为通用或特定于方案的语法中的分隔符,并且字符的出现必须在数据内。
通用 URI 的当前标准是 RFC 3986,其中有这样的内容:
因此,在其中明确允许使用逗号查询字符串,并且仅当特定方案将其定义为分隔符时才需要在数据中转义。 HTTP 方案不使用逗号或分号作为查询字符串中的分隔符,因此不需要对它们进行转义。浏览器是否遵循这个标准是另一回事。
使用 CSV 应该可以很好地处理字符串数据,您只需遵循标准 CSV 约定并引用数据或使用反斜杠转义逗号即可。
至于 RFC 2396,它还允许在 HTTP 查询字符串中使用未转义的逗号:
由于逗号在 HTTP 方案下没有保留用途,因此不必在数据中转义它们。第 2.3 节中关于保留字符的注释是那些在百分比编码时改变语义的字符,仅适用于一般情况;字符可以进行百分比编码,而不改变特定方案的语义,但仍然保留。
That a character is reserved within a generic URL component doesn't mean it must be escaped when it appears within the component or within data in the component. The character must also be defined as a delimiter within the generic or scheme-specific syntax and the appearance of the character must be within data.
The current standard for generic URIs is RFC 3986, which has this to say:
Thus commas are explicitly allowed within query strings and only need to be escaped in data if specific schemes define it as a delimiter. The HTTP scheme doesn't use the comma or semi-colon as a delimiter in query strings, so they don't need to be escaped. Whether browsers follow this standard is another matter.
Using CSV should work fine for string data, you just have to follow standard CSV conventions and either quote data or escape the commas with backslashes.
As for RFC 2396, it also allows for unescaped commas in HTTP query strings:
Since commas don't have a reserved purpose under the HTTP scheme, they don't have to be escaped in data. The note from § 2.3 about reserved characters being those that change semantics when percent-encoded applies only generally; characters may be percent-encoded without changing semantics for specific schemes and yet still be reserved.
我认为真正的问题是:“查询字符串中应该编码哪些字符?”这主要取决于两件事:字符的有效性和含义。
根据 RFC 标准的有效性
在 RFC3986 中我们可以找到哪些特殊字符有效且不在查询字符串内:
与标准的偏差
浏览器和 Web 框架并不总是严格遵循 RFC 标准。下面是一些示例:
[
,]
无效,但 Chrome 和 Firefox 不会在查询字符串中对这些字符进行编码。 Chrome 开发人员给出的推理很简单:“如果其他浏览器和 RFC 不同意,我们通常会匹配其他浏览器。”另一方面,ASP.NET Core 中的 QueryHelpers.AddQueryString 将对这些字符进行编码。Chrome 和 Firefox 未编码的其他无效字符包括:
'
是查询字符串中的有效字符,但 Chrome、Firefox 和QueryHelpers
仍会对其进行编码。 Firefox 开发人员给出的解释是他们知道他们不这样做不必根据 RFC 标准对其进行编码,但这样做是为了减少漏洞。特殊含义
某些字符是有效的,也不会被浏览器编码,但在某些情况下仍应进行编码。
+
:空格通常编码为%20
,但也可以编码为+
。因此,查询字符串中的+
表示它是一个编码空间。如果您想包含一个实际上应该表示加号的字符,那么您必须使用+
的编码版本,即%2B
。~
:一些旧的 Unix 系统将以~
开头的 URI 部分解释为主目录的路径。因此,如果~
并不表示旧系统的 Unix 主目录路径的开始,那么最好对它进行编码(所以现在可能总是进行编码)。=
、&
:通常(尽管 RFC 没有指定这是必需的)查询字符串包含格式为“key1=value1&key2=value2”的参数。如果是这种情况,则=
或&
应该是参数键或参数值的一部分,而不是赋予它们分隔键和值或分隔的作用参数,那么您必须对这些=
和&
进行编码。因此,如果参数值由于某种原因应包含字符串“=&”那么它必须被编码为%3D%26
,然后可以用于完整的键和值:“weirdparam=%3D%26”。%
:通常 Web 框架会发现后面没有两个十六进制字符的%
仅表示%
本身,但这仍然是一个很好的选择当%
应该只表示%
而不是指示编码字符的开头(例如%7C
)时,始终对%
进行编码,因为 RFC3986 指定%
仅当后跟两个十六进制字符时才有效。因此,不要使用“percentageparam=%”,而应使用“percentageparam=%25”。编码指南
根据 RFC3986 编码每个无效的字符*,以及每个可能具有特殊含义但只能按字面方式解释而不赋予其特殊含义的字符。您还可以对不需要编码的内容进行编码,例如
'
。为什么?因为编码超出必要的范围并没有什么坏处。服务器和 Web 框架在解析查询字符串时将解码每个编码字符,无论是否确实需要事先对该字符进行编码。查询字符串中唯一不应该编码的字符是那些可能具有特殊含义且不应该丢失该特殊含义的字符,例如,不要对“key1=value1”的
=
进行编码。为此,不要将编码方法应用于整个查询字符串(也不要应用于整个 URI),而是仅将其单独应用于查询参数键和值。例如,对于 JS:请注意,
encodeURIComponent
编码的字符比查询字符串中有效且没有特殊含义的必要含义字符多得多,例如/
, <代码>?,...原因是
encodeURIComponent
不是单独为查询字符串创建的,而是对查询字符串之外具有特殊含义的字符进行编码,例如用于路径 URI 组件的/
。QueryHelpers.AddQueryString
的工作方式类似。它在底层使用了 System.Text.Encodings.Web.DefaultUrlEncoder,它不仅适用于查询字符串,还适用于 isegment、ipath-noscheme 和 ifragment。* 您可能只将那些 RFC 不允许且始终由 Chrome 编码的字符视为无效字符。这将是
Space " < >
。但是,为了安全起见,最好对 RFC3986 认为无效的所有内容进行编码。OP 的问题
我的第一个问题简单地说,编码逗号真的有必要 -> 不,没有必要,但它不会造成伤害(除了丑陋),并且会在默认编码方法(例如
encodeURIComponent
)中发生,并且解码和查询字符串解析应该可以工作不过,我的第二个问题是,这是一个有效的 URL (page.html?q=1;q=2;q=3)吗? -> 它是 RFC 有效的,但您的服务器/Web 框架可能很难解析该查询。字符串,当它可能期望查询字符串的典型“key1=value1&key2=value2”格式时,
这个规范在哪里? -> 没有一个涵盖所有内容的规范,因为有些东西是特定于实现的。是在查询字符串内指定数组的不同方式。
I think the real question is: "What characters should be encoded in a query string?" And that depends mainly on two things: The validity and the meaning of a character.
Validity according to the RFC standard
In RFC3986 we can find which special characters are valid and which are not inside a query string:
Deviations from the standard
Browsers and web frameworks do not always strictly follow the RFC standard. Below are some examples:
[
,]
are not valid, but Chrome and Firefox do not encode these characters inside a query string. The reasoning given by Chrome devs is simply: "If other browsers and an RFC disagree, we will generally match other browsers."QueryHelpers.AddQueryString
from ASP.NET Core on the other hand will encode these characters.Other invalid characters that are not encoded by Chrome and Firefox are:
'
is a valid character inside a query string but will be encoded by Chrome, Firefox andQueryHelpers
nevertheless. The explanation given by Firefox devs is that they knew that they don't have to encode it according to the RFC standard, but did it to reduce vulnerabilities.Special meaning
Some characters are valid and also don't get encoded by browsers, but should still be encoded in certain cases.
+
: Spaces are normally encoded as%20
but alternatively they can be encoded as+
. So+
inside a query string means it's an encoded space. If you want to include a character that's actually supposed to literally mean plus, then you have to use the encoded version of+
which is%2B
.~
: Some old Unix systems interpreted URI parts that started with~
as a path to a home directory. So it's a good idea to encode~
if it's not meant to denote the start of a Unix home directory path for an old system (so nowadays probably always encode).=
,&
: Usually (although RFC doesn't specify that this is required) query strings contain parameters in the format "key1=value1&key2=value2". If that's the case and=
s or&
s should be part of the parameter key or the parameter value instead of giving them the role of separating the key and value or separating the parameters, then you have to encode those=
s and&
s. So if a parameter value should for some reason consist of the string "=&" then it has to be encoded as%3D%26
which then can be used for the full key and value: "weirdparam=%3D%26".%
: Usually web frameworks figure out that%
s that are not followed by two hex characters simply mean the%
itself, but it's still a good idea to always encode%
when it's supposed to only mean%
and not indicate the start of an encoded character (e.g.%7C
) because RFC3986 specifies that%
is only valid when followed by two hex characters. So don't use "percentageparam=%" use "percentageparam=%25" instead.Encoding guidelines
Encode every character that is otherwise invalid* according to RFC3986 and every character that can have special meaning but should only be interpreted in a literal way without giving it a special meaning. You can also encode things that aren't required to be encoded, like
'
. Why? Because it doesn't hurt to encode more than necessary. Servers and web frameworks when parsing a query string will decode every encoded character, no matter if it was really necessary to previously encode that character or not.The only characters of a query string that shouldn't be encoded are those that can have a special meaning and shouldn't lose that special meaning, e.g. don't encode the
=
of "key1=value1". For that to achieve don't apply an encoding method to the whole query string (and also not to the whole URI) but apply it only and separately to the query parameter keys and values. For example, with JS:Note that
encodeURIComponent
encodes a lot more characters than necessary meaning characters that are valid in a query string and don't have special meaning there e.g./
,?
, ...The reason is that
encodeURIComponent
wasn't created for query strings alone but instead encodes characters that have special meaning outside of the query string as well, e.g./
for the path URI component.QueryHelpers.AddQueryString
works in a similar manner. Under the hood it usesSystem.Text.Encodings.Web.DefaultUrlEncoder
which is not just meant for query strings but also for isegment, ipath-noscheme and ifragment.* You could probably get away with only regarding those characters as invalid that are both not allowed by the RFC and that are also always encoded by Chrome for instance. This would be
Space " < >
. But it's probably better to be on the safer side and encode at least everything that RFC3986 considers invalid.OP's questions
My first question is simply, is encoding commas really necessary -> No it's not necessary, but it doesn't hurt (except ugliness) and will happen with default encoding methods e.g.
encodeURIComponent
and decoding and query string parsing should work nevertheless.My second question, is this a valid URL (page.html?q=1;q=2;q=3)? -> It's RFC valid, but your server / web framework might have a hard time parsing the query string when it might expect the typical "key1=value1&key2=value2" format for query strings.
Where is specification for this? -> There isn't a single specification that covers everything because some things are implementation specific. For instance there are different ways of specifying arrays inside of query strings.
只需使用
?q=1+2+3
我在这里回答第四个问题:),它没有问,但都是从以下开始的:我如何传递逗号分隔值的数字列表?在我看来,最好的方法就是以空格分隔传递它们,其中空格将被 url 形式编码为
+
。只要您知道列表中的值不包含空格(数字往往不包含空格),效果就很好。Just use
?q=1+2+3
I am answering here a fourth question :) that did not ask but all started with: how do i pass list of numbers a-la comma-separated values? Seems to me the best approach is just to pass them space-separated, where spaces will get url-form-encoded to
+
. Works great, as longs as you know the values in the list contain no spaces (something numbers tend not to).是的。
;
是保留的,但不是由 RFC 保留的。定义此组件的上下文是application/x-www-form-urlencoded
媒体类型的定义,该媒体类型是 HTML 标准的一部分(17.13.4.1)。特别是隐藏在 B.2.2:不幸的是,许多流行的服务器端脚本框架(包括 ASP.NET)不支持这种用法。
Yes. The
;
is reserved, but not by an RFC. The context that defines this component is the definition of theapplication/x-www-form-urlencoded
media type, which is part of the HTML standard (section 17.13.4.1). In particular the sneaky note hidden away in section B.2.2:Unfortunately many popular server-side scripting frameworks including ASP.NET do not support this usage.
我想指出
page.html?q=1&q=2&q=3
也是一个有效的网址。这是在查询字符串中表达数组的完全合法的方式。您的服务器技术将决定其呈现方式。在经典 ASP 中,您检查
Response.QueryString("q").Count
,然后使用Response.QueryString("q")(0)
(以及 (1) 和(2))。请注意,您也在 ASP.NET 中看到了这一点(我认为这不是有意的,但请注意):
请注意,分号被忽略,因此您定义了两次
a
,并且获得了它的值两次,用逗号分隔。使用所有 & 符号Default.aspx?a=1&a=2&b=1&a=3
将产生a
作为“1,2,3”。但我确信有一种方法可以获取每个单独的元素,以防元素本身包含逗号。它只是非索引 QueryString 的默认属性,它将子值与逗号分隔符连接在一起。I would like to note that
page.html?q=1&q=2&q=3
is a valid url as well. This is a completely legitimate way of expressing an array in a query string. Your server technology will determine how exactly that is presented.In Classic ASP, you check
Response.QueryString("q").Count
and then useResponse.QueryString("q")(0)
(and (1) and (2)).Note that you saw this in your ASP.NET, too (I think it was not intended, but look):
Notice that the semicolon is ignored, so you have
a
defined twice, and you got its value twice, separated by a comma. Using all ampersandsDefault.aspx?a=1&a=2&b=1&a=3
will yielda
as "1,2,3". But I am sure there is a method to get each individual element, in case the elements themselves contain commas. It is simply the default property of the non-indexed QueryString that concatenates the sub-values together with comma separators.我有同样的问题。超链接的 URL 是第三方 URL,仅需要
page.html?q=1,2,3
格式的参数列表,并且 URLpage.html?q= 1%2C2%2C3
不起作用。我能够使用 javascript 让它工作。可能不是最好的方法,但可以在此处查看解决方案(如果它对任何人有帮助)。I had the same issue. The URL that was hyperlinked was a third party URL and was expecting a list of parameters in format
page.html?q=1,2,3
ONLY and the URLpage.html?q=1%2C2%2C3
did not work. I was able to get it working using javascript. May not be the best approach but can check out the solution here if it helps anyone.如果您要将编码字符发送到 FLASH/SWF 文件,那么您应该对字符进行两次编码! (因为Flash解析器)
If you are sending the ENCODED characters to FLASH/SWF file, then you should ENCODE the character twice!! (because of Flash parser)