如果我在 ISO-8859-1 站点上使用 Unicode,浏览器将如何解释它?

发布于 2024-09-03 06:25:34 字数 648 浏览 8 评论 0原文

所以我有一个使用 ISO-8859-1 编码的网站,但我无法更改它。我想确保我在网站上的 Web 应用程序中输入的内容得到正确解析。解析器逐个字符地工作。我也无法更改解析器,我只是编写文件供其处理。我告诉应用程序在解析后显示的文件内容包含 Unicode 字符(或者至少我这么认为,即使它们是由映射到 CP437 的 Windows Alt 代码生成的)。由于解析器的逐个字符操作,使用实体不是一种选择。解析器在输出时转义的唯一字符是标记敏感字符,例如与号、小于号和大于号。我只想继续进行下去,看看它是什么样子,但输出只能在发布上看到,它必须花几天时间获得批准等,而对于一个测试用例来说,这要求太多了。

所以,长话短说,如果我告诉一个网站在一个带有元标记的网站上输出 ▼ÇÑ¥☺☻,说明它应该使用 ISO-8859-1,浏览器会自动检测 Unicode 并显示它还是会它按字面意思将其翻译为 ISO-8859-1 并获得一组不同的字符?

更新:我在 http://doorstop.csh.rit.edu/home 创建了一个临时测试站点/testing 我在 Notepad++ 中使用没有 BOM 的 UTF-8 制作了测试文件,但使用了将编码设置为 ISO-8859-1 的元标记。

So I got a site that uses ISO-8859-1 encoding and I can't change that. I want to be sure that the content I enter into the web app on the site gets parsed correctly. The parser works on a character by character basis. I also cannot change the parser, I am just writing files for it to handle. The content in my file I am telling the app to display after parsing contains Unicode characters (or at least I assume so, even if they were produced by Windows Alt Codes mapped to CP437). Using entities is not an option due to the character by character operation of the parser. The only characters that the parser escapes upon output are markup sensitive ones like ampersand, less than, and greater than symbols. I would just go ahead and put this through to see what it looks like, but output can only be seen on a publishing, which has to spend a couple days getting approved and such, and that would be asking too much for just a test case.

So, long story short, if I told a site to output ▼ÇÑ¥☺☻ on a site with a meta tag stating it is supposed to use ISO-8859-1, will a browser auto-detect the Unicode and display it or will it literally translate it as ISO-8859-1 and get a different set of characters?

UPDATE: I made a temporary test site at http://doorstop.csh.rit.edu/home/testing where I made the test file in Notepad++ using UTF-8 with no BOM but used a meta tag that set the encoding to ISO-8859-1.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

水染的天色ゝ 2024-09-10 06:25:34

如果您将 UTF-8 发送到被告知期望 ISO-8859-1,那么是的,你会得到 Mojibake :(

考虑一下 UTF-8 序列是简单地用带有高位的 8 位字符引入的设置(即字符值> 127)。期望简单的8位字符编码的东西如何决定特定序列应该被解释为UTF-8而不是被告知使用的编码?

If you send UTF-8 to something told to expect ISO-8859-1, then yes, you'll be getting Mojibake :(

Consider that a UTF-8 sequence is introduced simply with an 8-bit char with the high bit set (i.e. a char value > 127). How is something expecting a simple 8 bit character encoding going to decide that a particular sequence should be interpreted as UTF-8 and not the encoding it was told to use?

感受沵的脚步 2024-09-10 06:25:34

解析器在输出时转义的唯一字符是标记敏感字符,例如“&”、小于和大于符号。

ISO-8859-1 之外的任何内容都可能会导致问题。编码为 ISO-8859-1 的 HTML 可以显示像 ▼☺☻ 这样的字符,但只能通过将它们转义为 ▼☺☻ 来实现。否则,它们就超出了编码范围。

字符 ÇÑ¥ 受 ISO-8859-1 支持,在正确实施的系统中不应造成问题。

解析器是否可以在显示之前正确解析文件取决于其实现以及它及其 Web 容器是否尊重您可能能够发送的任何编码元数据。

Unicode 是一种支持多种编码的字符集。例如,编码为 UTF-8 的 U+263a ☺ 变为字节 e2 98 ba,如果将其视为 ISO-8859-1,则该字节将解码为 â。

The only characters that the parser escapes upon output are markup sensitive ones like ampersand, less than, and greater than symbols.

Anything outside ISO-8859-1 is likely to cause problems. HTML encoded as ISO-8859-1 can display the character like ▼☺☻, but only by escaping them as ▼☺☻. Otherwise, they're simply outside the range of the encoding.

The characters ÇÑ¥ are supported by ISO-8859-1 and should not cause a problem in a correctly implemented system.

Whether the parser could be used to parse the file correctly prior to display depends on its implementation and whether it and its web container respect any encoding metadata you might be able to send it.

Unicode is a character set supported by multiple encodings. For example, U+263a ☺ encoded as UTF-8 becomes the bytes e2 98 ba which would be decoded as ☺ if treated as ISO-8859-1.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文