我应该在 URL 中使用重音字符吗?
当用英语以外的语言创建网页内容时,搜索引擎优化和用户友好 URL 的问题就会出现。
我想知道在 URL 中使用去重音字母是否是最佳实践——冒着某些单词在有或没有某些重音的情况下具有完全不同含义的风险——或者最好坚持使用非英语字符在不太高级的环境(例如 MSIE、查看源代码)中适当牺牲这些 URL 的可读性。
“异国情调”字母可能出现在任何地方:文档标题、标签、用户名等,因此它们并不总是处于网站维护者的完全监督之下。
当然,一种可能的方法是设置备用的(不带重音符号的)URL,该 URL 会指向原始目的地,但我想了解您对使用带重音符号的 URL 作为主要文档标识符的看法。
When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这里没有歧义:RFC3986 说不,也就是说,URI 不能包含 unicode 字符,仅 ASCII。
完全不同的问题是浏览器在显示 URI 时如何表示编码字符,例如某些浏览器会在 URL 中显示空格而不是“%20”。这也是 IDN 的工作原理:punycoded 字符串由浏览器动态编码和解码,因此,如果您访问 Cafe.com,您实际上是在访问 xn--caf-dma.com。 URL 中看似 unicode 字符实际上只是浏览器的“视觉糖”:如果您使用不支持 IDN 或 unicode 的浏览器,则编码版本将无法工作,因为 URL 的底层定义只是不支持它,因此为了使其一致工作,您需要进行%编码。
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
当遇到类似的问题时,我利用URL重写来允许通过带重音或不带重音的字符。实际的 URL 类似于
重写+字符翻译功能允许此引用
加载相同的资源。因此,为了回答您的问题,作为主要资源标识符,我将自己限制为 0-9、AZ、az 以及偶尔的连字符。
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
And a rewriting+character translating function allows this reference
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
考虑到带重音的 URL 通常最终会看起来像这样:
...这不太好...我认为我们仍将使用非重音 URL 一段时间。
不过,事情应该会变得更好,因为带重音的 URL 现在已经被网络浏览器接受了。
我当前使用的 firefox 3.5 以良好的方式显示 URL,而不是使用 %stuff,btw ;这似乎是自 firefox 3.0 以来的“新”内容(请参阅 Firefox 3:地址栏中支持 UTF-8) ;所以,至少 IE 6 可能不支持——而且仍然有太多人使用这个:-(
也许不带重音的 URL 看上去并不是最好的;但是,人们仍然习惯了它们,并且似乎普遍很好地理解它们。
Considering URLs with accents often tend to end up looking like this :
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
您应该避免在可能由用户在浏览器中手动输入的 URL 中使用非 ASCII 字符。对于服务器预编码的嵌入链接来说是可以的。
我们发现浏览器可以用不同的方式对 URL 进行编码,但很难弄清楚它使用什么编码。请参阅我对此问题的问题,
在 Tomcat 上处理 URI 中的字符编码
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
完整 URL 中有多个区域,每个区域可能有不同的规则。
该协议是纯 ASCII。
DNS 条目受 IDN(国际域名)规则管理,并且可以包含(大多数)Unicode 字符。
路径(第一个 / 之后)、用户名和密码也可以是所有内容。它们被转义(如 %XX),但这些只是字节。这些字节的编码是什么很难知道(由http服务器解释)。
参数部分(在第一个?之后)“按原样”(在 %XX 转义之后)传递到某些服务器端应用程序(php、asp、jsp、cgi),以及如何解释字节是另一个故事)。
建议路径/用户/密码/参数为 utf-8,但不是强制的,并不是每个人都遵守这一点。
所以你绝对应该允许非 ASCII(我们已经不再是 80 年代了),但你到底用它做什么可能会很棘手。尝试使用 Unicode 并远离遗留代码页,如果可以的话,使用正确的编码/字符集标记您的内容(在 html 中使用元,asp/jsp 的语言指令等)
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)