规范化 (webdav) unicode 路径
我正在开发PHP 的 WebDAV 实现。为了使 Windows 和其他操作系统更容易协同工作,我需要跳过一些字符编码环节。
Windows 在其 HTTP 请求中使用 ISO-8859-1,而大多数其他客户端将除 ascii 之外的任何内容编码为 UTF-8。
我的第一个方法是完全忽略这一点,但在返回网址时我很快就遇到了问题。然后我认为最好标准化所有网址。
以 ü 为例。这将由 OS/X 通过网络发送,因为
u%CC%88 (this is codepoint U+0308)
Windows 将其发送为:
%FC (latin1)
但是,在 %FC 上执行 utf8_encode,我得到:
%C3%BC (this is codepoint U+00FC)
我应该将 %C3%BC 和 u%CC%88 视为同一件事吗?如果是这样..怎么办?不碰它似乎对 Windows 来说工作正常。它以某种方式理解它是一个 unicode 字符,但更新同一文件会引发错误(没有特殊原因)。
我很乐意提供更多信息。
I'm working on a WebDAV implementation for PHP. In order to make it easier for Windows and other operating systems to work together, I need jump through some character encoding hoops.
Windows uses ISO-8859-1 in it's HTTP request, while most other clients encode anything beyond ascii as UTF-8.
My first approach was to ignore this altogether, but I quickly ran into issues when returning urls. I then figured it's probably best to normalize all urls.
Using ü as an example. This will get sent over the wire by OS/X as
u%CC%88 (this is codepoint U+0308)
Windows sents this as:
%FC (latin1)
But, doing a utf8_encode on %FC, I get :
%C3%BC (this is codepoint U+00FC)
Should I treat %C3%BC and u%CC%88 as the same thing? If so.. how? Not touching it seems to work OK for windows. It somehow understands that it's a unicode character, but updating the same file throws an error (for no particular reason).
I'd be happy to provide more information.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Mac 将 unicode 字符存储为“分解”,即“u”+ ¡(分音符)而不是“ü”。 Normalizer 可以解决这个问题。如果您没有 Normalizer,请尝试
iconv('UTF8-MAC', 'UTF8', $str)
Mac stores unicode chars as "decomposed", that is, "u" + ¨ (diaresis) instead of "ü". Normalizer can take care of that. If you don't have Normalizer, try
iconv('UTF8-MAC', 'UTF8', $str)
我讨厌回答自己的问题,但就这样吧。
我最终没有打扰。对各种操作系统如何编码和处理编码进行了广泛的研究。事实证明,在大多数情况下,其他操作系统可以使用其他标准化形式来处理路径。虽然 Windows 工作起来有点糟糕,但它确实有效。
每当我收到实际上完全非 utf8 的路径时,我都会尝试检测编码并将其转换为 UTF-8。
I hate answering my own questions, but here goes.
I ended up not bothering. Did extensive research on how various operating systems encode, and handle encodings. Turns out that in most cases other os's handle paths using other normalization forms alright. Windows worked a bit shitty though, but it works.
Whenever I receive a path that's actually non-utf8 altogether, I try to detect the encoding and convert it to UTF-8.