当前位置：文江博客话题详情

为什么要使用 urlencode？

发布于 2024-10-11 14:16:35 字数 742 浏览 3 评论 0原文

我正在编写一个 Web 应用程序并学习如何对 html 链接进行 urlencode...

这里的所有 urlencode 问题（请参阅下面的标签）都是“如何...？”问题。

我的问题不是“如何？”但为什么？”。

甚至维基百科文章也只讨论了它的机制：
http://en.wikipedia.org/wiki/Urlencode 但根本不是为什么我应该在我的应用程序中使用 urlencode 。

使用（或不使用）urlencode 的安全影响是什么？

如何利用 urlencode 失败？

未编码的 url 会出现什么样的错误或失败？

我问这个问题是因为即使没有 urlencode，我的应用程序开发网站的链接（例如以下工作按预期进行： http://myapp/my%20test/ée/ràé

为什么我应该使用 urlencode？

或者另一种说法：

我应该何时使用 urlencode？什么情况下？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

知足的幸福 2024-10-18 14:16:35

更新：上面有一个更好的解释（imo）：

URI 表示为字符序列，而不是序列
八位位组。这是因为 URI 可能通过以下方式“传输”
不是通过计算机网络，例如，打印在纸上、阅读
收音机等

以及

对于包含非 ASCII 字符的原始字符序列，
然而，情况更加困难。互联网协议
传输旨在表示字符序列的八位位组序列
预计会提供某种方式来识别所使用的字符集，如果
可能不止一个 [RFC2277]。然而，目前有
通用 URI 语法中没有提供实现此目的的规定
鉴别。单独的 URI 方案可能需要单个
字符集，定义默认字符集，或提供一种方法来指示
使用的字符集。

因为 RFC 中有说明：

2.4。转义序列
如果数据没有使用
毫无保留的性格；这包括不对应的数据
US-ASCII 编码字符集的可打印字符，或者
对应于任何不允许的 US-ASCII 字符，如
解释如下。

和

2.4.2。何时转义和取消转义
URI 始终处于“转义”形式，因为转义或取消转义
完整的 URI 可能会改变其语义。一般情况下，只有一次
在创建 URI 时可以安全地进行转义编码
从其组成部分；每个组件可能有自己的一组
保留的字符，因此只有负责的机制
生成或解释该组件可以确定转义字符是否会改变其语义。同样，一个 URI
必须在转义字符之前将其分成各个组成部分
这些组件中的内容可以被安全地解码。
在某些情况下，可以用未保留的数据表示的数据
字符可能会出现转义；例如，一些未保留的
“mark”字符会被某些系统自动转义。如果
给定的 URI 方案定义了一个规范化算法，那么
根据该算法，未保留的字符可能不会被转义。
例如，有时在 http URL 中使用“%7e”代替“~”
路径，但对于 http URL，这两者是等效的。
因为百分号“%”字符始终具有保留的用途
作为转义指示符，必须将其转义为“%25”才能
用作 URI 中的数据。实施者应注意不要
多次转义或取消转义同一字符串，因为取消转义
已经未转义的字符串可能会导致百分比的误解
数据字符作为另一个转义字符，反之亦然
转义已经转义的字符串的情况。

Update: There is an even better explanation (imo) further above:

A URI is represented as a sequence of characters, not as a sequence
of octets. That is because URI might be "transported" by means that
are not through a computer network, e.g., printed on paper, read over
the radio, etc.

and

For original character sequences that contain non-ASCII characters,
however, the situation is more difficult. Internet protocols that
transmit octet sequences intended to represent character sequences
are expected to provide some way of identifying the charset used, if
there might be more than one [RFC2277]. However, there is currently
no provision within the generic URI syntax to accomplish this
identification. An individual URI scheme may require a single
charset, define a default charset, or provide a way to indicate the
charset used.

Because it is stated in the RFC:

2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.

and

2.4.2. When to Escape and Unescape
A URI is always in an "escaped" form, since escaping or unescaping a
completed URI might change its semantics. Normally, the only time
escape encodings can safely be made is when the URI is being created
from its component parts; each component may have its own set of
characters that are reserved, so only the mechanism responsible for
generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI
must be separated into its components before the escaped characters
within those components can be safely decoded.
In some cases, data that could be represented by an unreserved
character may appear escaped; for example, some of the unreserved
"mark" characters are automatically escaped by some systems. If the
given URI scheme defines a canonicalization algorithm, then
unreserved characters may be unescaped according to that algorithm.
For example, "%7e" is sometimes used instead of "~" in an http URL
path, but the two are equivalent for an http URL.
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI. Implementers should be careful not to
escape or unescape the same string more than once, since unescaping
an already unescaped string might lead to misinterpreting a percent
data character as another escaped character, or vice versa in the
case of escaping an already escaped string.

回复收藏 0 原文

挖个坑埋了你 2024-10-18 14:16:35

主要原因是它本质上转义要包含在网页 URL 中的字符。

假设用户输入用户表单字段为“&joe”，并且我们希望使用 URL 编码重定向到包含该名称作为 URL 一部分的页面，则例如：

localhost/index.php?name=%26joe //note how the ampersand is escaped

如果您没有使用 urlencoding ，你最终会得到：

localhost/index.php?name=&joe

并且那个＆符号会导致各种不可预测性

The main reason is it essentially escapes characters to be included in the URL of your webpage.

Suppose a user inputs a user form field as "&joe" and we would like to redirect to a page which contains that name as part of the URL, using URL encoding, it would then be, for example:

localhost/index.php?name=%26joe //note how the ampersand is escaped

If you didnt use urlencoding, you would end up with:

localhost/index.php?name=&joe

and that ampersand would cause all sorts of unpredictability

回复收藏 0 原文

悸初 2024-10-18 14:16:35

应使用 URL 编码的原因有两个：

当您需要传递对 URL 无效的字符时，例如 „ <; > #%\| ^ [ ] ` 空格。例如，空格不是有效的 URL 字符，因为如果文本中包含空格，则在文本中识别完整的 URL 将是不明确的。
当您需要传递为 URL 保留的字符时，例如 ! # $ % & ' ( ) * + , / : ; =？ @[]。例如， ? 被保留用于标记查询参数的开始，如果我们不在路径或查询参数内部对 ? 进行编码，则可能会破坏语法。

回复收藏 0 原文

无名指的心愿 2024-10-18 14:16:35

有 RFC 定义 URL 格式，浏览器/Web 服务器开发人员依赖将此作为解释数据的标准。如果不遵守，结果可能难以预测。

HTTP URL 有其规范，它规定几乎所有非拉丁字符都需要编码。

回复收藏 0 原文

痴者 2024-10-18 14:16:35

我能想到的两个原因：

这实际上取决于您如何解析查询服务器端。例如，使用 HTTP 的 GET 请求传递参数时，如果某个参数中存在诸如 & 之类的字符，就会出现问题。
它允许您按照您想要的方式处理非 ansi 字符（您指定编码）。否则，浏览器可能会以某种随机编码传递它们（不要认为它在任何标准中真正定义；如果我错了，请纠正我）。

回复收藏 0 原文

如梦亦如幻 2024-10-18 14:16:35

如果您的两条路径是这样的

http://myapp/my%20test/

并且

http://myapp/my test/

注意空格和空格，您将如何区分？ %20 是 URL 的一部分。

How will you distinguish if your two of path are like this

http://myapp/my%20test/

and

http://myapp/my test/

Note space & %20 is part of URL.

回复收藏 0 原文

许一世地老天荒 2024-10-18 14:16:35

URL 编码是将字符串转换为有效 URL 格式的过程。有效的 URL 格式意味着 URL 仅包含所谓的“字母 | 数字 | 安全 | 额外 | 转义”字符。

URL编码通常用于转换通过html表单传递的数据，因为此类数据可能包含特殊字符，例如“/”、“.”、“#”等，这些字符可能： a) 有特殊含义；或 b) 不是 URL 的有效字符；或 c) 可以在传输过程中更改。例如，“#”字符需要进行编码，因为它具有 html 锚点的特殊含义。该字符还需要进行编码，因为在有效的 URL 格式中不允许使用该字符。此外，某些字符（例如“~”）可能无法在 Internet 上正确传输。

回复收藏 0 原文