即使设置了 UTF-8 编码,在浏览器中显示俄语字母时出现问题

发布于 2024-11-30 18:18:49 字数 915 浏览 0 评论 0原文

我知道也有一些类似的问题。然而,在阅读答案并研究该主题后,我仍然在努力在浏览器中显示俄语字母。我将它们存储在 .csv 文件中(以 UTF-8 无 BOM 编码)。在读取 .csv(也以 UTF-8 无 BOM 编码)的 php 文件中,我声明了字符集:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

要打开并迭代 .csv 文件,我使用以下代码:

  if(($handle = fopen($path, "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
      ...
    }
  }

并且没有显示任何内容或类似这样:

 -ам-Зее

而不是

 Целль-ам-Зее

还有什么想法我还可以尝试吗?

更新:

将浏览器编码设置为 UTF-8 后,我得到了正确的俄语字母。然而,仍然有一些文本根本没有显示。我怀疑我在读取 .csv 文件时做了一些不正确的事情,简化版本是:(

     if(($handle = fopen($path, "r")) !== FALSE) {
       while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
         echo $data[1];
        }
     }

我省略第一列并显示第二列的内容,该列总是被填充)

I am aware that there were some similar problems. However after reading answers and gooling about the topic I am still struggling with displaying Russian letters in the browser. I have them stored inside .csv file (which is encoded in UTF-8 no BOM). In my php file which reads .csv (which is also encoded in UTF-8 no BOM) I declared charset:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

To open and iterate through .csv file I am using following code:

  if(($handle = fopen($path, "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
      ...
    }
  }

and either nothing is displayed or something like this:

 -ам-Зее

instead of

 Целль-ам-Зее

Any ideas what else I can try?

UPDATE:

After setting browser encoding to UTF-8 I get correct russian letters. However still some of the text is not displayed at all. I suspect that I do something incorectly while reading .csv file, the simplified version is:

     if(($handle = fopen($path, "r")) !== FALSE) {
       while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
         echo $data[1];
        }
     }

( I omit first column and display the content of the second one, which is always filled )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

手心的海 2024-12-07 18:18:49

检查您的服务器配置

您是否已将 Apache 配置为支持 字符集覆盖?默认情况下,它使用 ISO-8859-1 作为其默认并忽略它所提供的网页中出现的任何覆盖。

解决方案#1(共 3 个)

例如,您可以将其放入 .htaccess 文件中作为封闭目录,现在您的网页将获得其 覆盖:

AddDefaultCharset Off
AddCharset UTF-8 .html

Apache 文档 指出:

当且仅当响应的 content-type时,此指令指定要添加到响应的媒体类型字符集参数(字符编码的名称)的默认值文本/纯文本文本/html。这应该覆盖通过 META 元素在响应正文中指定的任何字符集,但确切的行为通常取决于用户的客户端配置。设置 AddDefaultCharset Off 将禁用此功能。 AddDefaultCharset On 启用默认字符集 iso-8859-1。任何其他值均假定为要使用的字符集,该值应为 IANA 注册字符集值之一 用于 MIME 媒体类型。例如:

 AddDefaultCharset utf-8     

AddDefaultCharset 仅当已知其应用的所有文本资源均采用该字符编码并且单独标记其字符集太不方便时才应使用。其中一个示例是将 charset 参数添加到包含生成内容的资源中,例如旧版 CGI 脚本,由于输出中包含用户提供的数据,这些资源可能容易​​受到跨站点脚本攻击。但请注意,更好的解决方案是修复(或删除)这些脚本,因为设置默认字符集并不能保护在浏览器上启用“自动检测字符编码”功能的用户。

在关闭 AddDefaultCharset 之前,我无法让我的 标记正常工作。这是相当神秘和令人沮丧的。不过,一旦我这样做了,一切就都很顺利了。

解决方案 #2(共 3 个)

如果您对 Apache 的配置文件具有写入权限,则可以更改服务器本身。但是,您必须确保没有任何其他内容依赖于旧的不可覆盖的设置。这是使用 .htaccess 的另一个原因。


当所有其他方法都失败时:解决方案#3(共 3 个)

如果您既无法更改整个服务器配置本身,也无法创建一个 .htaccess,其自身的设置将受到其下面任何内容的尊重,那么您唯一的选择就是使用超过 127 的所有代码点的数字实体。例如,

Целль-ам-Зее

您必须使用

Целль-ам-Зее

or

Целль-ам-Зее

代替。这样做的优点是它不再需要 覆盖并摆弄服务器或.htaccess 文件。缺点是需要额外的翻译过程,这会妨碍使用理解文字 UTF-8 的编辑器直接编辑文件。

实体忽略编码

它起作用的原因是因为所有 HTML 始终采用 Unicode,因此字符号 1062 始终为 西里尔大写字母 TSE 等。实体编号始终代表 Unicode 代码点编号;它们绝不是文档编码中的数字。只有编码字节才算作服务器或页面编码,而不是始终为 Unicode 的未编码代码点数字。

这就是为什么我们可以使用像 é 这样的东西,它总是意味着 带有尖锐音的拉丁文小写字母 E,因为代码点 233 总是即使网页本身应该采用其他编码(例如 MacRoman 中的 142 或 NextStep 中的 221)。

字符数始终为 Unicode 数字,不关心编码。这是因为 HTML、XHTML 和 XML 等标记语言始终使用逻辑 Unicode 代码点数字,就像 Perl 和 Go 等编程语言一样。 (PHP 实际上只是字节,上面有一些 UTF-8 API,但正如您所知,它仍然存在问题。这既是由于其内部模型,也是由于 Web 服务器甚至 Web 客户端,所有这些都使得 PHP 中的一切比大多数其他语言更加复杂。)

即使您已使用 ISO-8859-1 西里尔字母对网页进行编码,其中文字 0xC6 字节编码 Unicode U+0426,CYRILLIC大写字母 TSE,作为字符实体,您可以使用 ЦЦ — 而不是 Æ ; 这是错误的,因为 U+00C6 是拉丁大写字母 AE。

同样,如果您使用 MacCyrillic 编码,则文字 0x96 字节将是 CYRILLIC CAPITAL LETTER TSE,但由于数字实体始终采用 Unicode,因此您必须使用 ЦЦ — 而不是

我更喜欢对所有网页仅使用 UTF-8。嗯,对于新人来说,就是这样。我确实认识到遗留的非 Unicode 页面的存在。那些我只是保留原样。

Check Your Server Config

Do you have Apache configured to honor the <meta> charset override? By default it uses ISO-8859-1 for its default and ignores any overrides that appear in web pages it serves up.

Solution #1 of 3

For example, you can put this in your .htaccess file for an enclosing directory, and now your web pages will have their <meta> overrides honored:

AddDefaultCharset Off
AddCharset UTF-8 .html

The Apache documentation states:

This directive specifies a default value for the media type charset parameter (the name of a character encoding) to be added to a response if and only if the response's content-type is either text/plain or text/html. This should override any charset specified in the body of the response via a META element, though the exact behavior is often dependent on the user's client configuration. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables a default charset of iso-8859-1. Any other value is assumed to be the charset to be used, which should be one of the IANA registered charset values for use in MIME media types. For example:

   AddDefaultCharset utf-8     

AddDefaultCharset should only be used when all of the text resources to which it applies are known to be in that character encoding and it is too inconvenient to label their charset individually. One such example is to add the charset parameter to resources containing generated content, such as legacy CGI scripts, that might be vulnerable to cross‐site scripting attacks due to user‐provided data being included in the output. Note, however, that a better solution is to just fix (or delete) those scripts, since setting a default charset does not protect users that have enabled the “auto‐detect character encoding” feature on their browser.

Until I turned off AddDefaultCharset, I could not get my <meta> tags to work. It was quite mysterious and frustrating. Once I did, though, everything worked smoothly.

Solution #2 of 3

If you have write access to Apache’s configuration files, then you can change the server itself. However, you have to make sure nothing else relies on the old unoverridable setting. This is another reason to use .htaccess.


When All Else Fails: Solution #3 of 3

If you can neither change the overall server configuration itself nor create a .htaccess whose own settings will be respected for anything underneath it, then your only option is to use numeric entities for all code points over 127. For example, instead of

Целль-ам-Зее

you must instead use

Целль-ам-Зее

or

Целль-ам-Зее

The advantage of that is that it no longer requires a <meta> override and fiddling with the server or with .htaccess files. The disadvantage is that it takes an extra translation pass, which interferes with being able to directly edit the file with an editor that understand literal UTF‑8.

Entities Ignore Encodings

The reason it works is because all HTML is always in Unicode, so character number 1062 is always CYRILLIC CAPITAL LETTER TSE, etc. Entity numbers always represent Unicode code point numbers; they are never the numbers from the document encoding. Only encoded bytes count as being in the server or page encoding, not unencoded code point numbers which are always Unicode.

That’s why we can use something like é and it always means LATIN SMALL LETTER E WITH ACUTE, because code point 233 is always that character, even if the web page itself should be in some other encoding (like 142 in MacRoman or 221 in NextStep).

The numbers of characters are always Unicode numbers, and pay no attention to the encoding. That’s because markup languages like HTML, XHTML, and XML always use logical Unicode code point numbers, just like programming languages like Perl and Go do. (PHP is really just bytes with some UTF‑8 APIs on top of it, but as you have yourself learned, one still has issues with it. This is both because of its internal model but also due to web servers and even web clients, all of which makes everything more complicated in PHP than in most other languages.)

Even if you had encoded your web page in ISO-8859-1 for Cyrillic, where a literal 0xC6 byte encodes Unicode U+0426, CYRILLIC CAPITAL LETTER TSE, as a character entity you would use Ц or Ц — and not Æ which would be wrong since U+00C6 is LATIN CAPITAL LETTER AE.

Similarly, if you were using the MacCyrillic encoding, the literal 0x96 byte would be a CYRILLIC CAPITAL LETTER TSE, but because the numeric entity is always in Unicode, you must use Ц or Ц — and not .

I prefer using only UTF‑8 for all web pages. Well, for new ones, that is. I do recognize that legacy non‐Unicode pages exist. Those I just leave as is.

乱世争霸 2024-12-07 18:18:49

您需要在服务器上设置正确的区域设置。

if(!setlocale(LC_ALL, 'ru_RU.utf8')) 
    setlocale(LC_ALL, 'en_US.utf8');

然后您可以检查您的服务器是否已接受所需的区域设置

if(setlocale(LC_ALL, 0) == 'C')
    echo 'Error setting locale';

问题出在 fgetcsv 函数中,该函数使用了不正确的区域设置。如果您无法更改区域设置,您可以使用explode 将 fgetcsv 函数替换为您自己的函数

You need to set correct locale on your server.

if(!setlocale(LC_ALL, 'ru_RU.utf8')) 
    setlocale(LC_ALL, 'en_US.utf8');

And then you can check if your server has accepted needed locale

if(setlocale(LC_ALL, 0) == 'C')
    echo 'Error setting locale';

The problem is in fgetcsv function which is using incorrect locale. If you have no possibility to change locale you could replace fgetcsv function with your own using explode

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文