通过 PHP 使用 XSLT 的 UTF-8 编码问题

发布于 2024-09-18 04:51:14 字数 2449 浏览 19 评论 0原文

当通过 PHP 通过 XSLT 转换 XML 时,我遇到了一个令人讨厌的编码问题。

该问题可以总结/简化如下:当我复制带有 XSLT 样式表的(UTF-8 编码的)XHTML 文件时,某些字符显示错误。当我只显示同一个 XHTML 文件时,所有字符都正确显示。

以下文件说明了问题:

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we d&#239;&#223;&#960;&#955;&#509; &#145;special characters&#146;</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

当这样调用此脚本时(例如通过 http://localhost/encoding_test /encoding_test.php),转换后的 XHTML 文档中的所有字符都正常显示,除了 &#145;和&#146;字符实体(它们打开和关闭单引号)。我不是 Unicode 专家,但有两件事让我印象深刻:

  1. 所有其他字符实体都被正确解释(这可能暗示 &#145;的 UTF-8 性有关&#146;)
  2. 然而,当 XHTML 文件直接显示时(例如通过 http://localhost/encoding_test/encoding_test.php?bypass=true),所有字符都正确显示。

我想我已经在任何可能的地方为输出声明了 UTF-8 编码。其他人是否可能看到问题所在并可以纠正?

提前致谢!

罗恩·范登布兰登

I'm facing a nasty encoding issue when transforming XML via XSLT through PHP.

The problem can be summarised/dumbed down as follows: when I copy a (UTF-8 encoded) XHTML file with an XSLT stylesheet, some characters are displayed wrong. When I just show the same XHTML file, all characters come out correctly.

Following files illustrate the problem:

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we dïßπλǽ ‘special characters’</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

When this script is invoked as such (via e.g. http://localhost/encoding_test/encoding_test.php), all characters in the transformed XHTML document come out ok, except for the ‘ and ’ character entities (they're opening and closing single quotation marks). I'm not a Unicode expert, but two things strike me:

  1. all other character entities are interpreted correctly (which could imply something about the UTF-8-ness of and )
  2. yet, when the XHTML file is displayed unmediated (via e.g. http://localhost/encoding_test/encoding_test.php?bypass=true), all characters are displayed properly.

I think I've declared UTF-8 encoding for the output anywhere I could. Do others perhaps see what's wrong and can be righted?

Thanks in advance!

Ron Van den Branden

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

耳根太软 2024-09-25 04:51:20

不是可见的 Unicode 字符。

它们是单引号的旧 HTML 字符引用1,但是当您使用 XSLT 处理器处理它们时,处理器不会看到单引号,而是看到十进制代码 145 和 146 的 Unicode 字符,即 U+0090 和 U+0091

这些字符是私人使用的(即该用法不是由 Unicode 联盟定义的)C1 控制代码

解决方案是使用正确的 Unicode 字符

1事实上,这些代码映射到 Windows-1252 编码。它们由浏览器显示,但实际上在 HTML 中无效:

注意——上面的 SGML 声明,就像 HTML 2.0 的声明一样,
指定字符编号 128 到 159(80 到 9F 十六进制)
未使用。这意味着数字字符引用
在该范围内(例如 ')在 HTML 中是非法的。 ISO 8859-1 和 ISO 10646 均不包含以下字符
范围,为控制字符保留。

and are no visible Unicode characters.

They are old HTML character references1 for single quotes, but when you process them using an XSLT processor the processor doesn't see single quotes but the Unicode characters with decimal codes 145 and 146, i.e. U+0090 and U+0091.

These characters are private use (i.e. the usage is not defined by the Unicode consortium) C1 control codes.

The solution is to use the correct Unicode characters and .

1In fact, these are codes that map to Windows-1252 encoding. They are displayed by browsers but they are actually not valid in HTML:

NOTE -- the above SGML declaration, like that of HTML 2.0,
specifies the character numbers 128 to 159 (80 to 9F hex)
as UNUSED. This means that numeric character references
within that range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor ISO 10646 contain characters in that
range, which is reserved for control characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文