通过 PHP 使用 XSLT 的 UTF-8 编码问题

发布于 2024-09-18 04:51:14 字数 2449 浏览 19 评论 0原文

当通过 PHP 通过 XSLT 转换 XML 时，我遇到了一个令人讨厌的编码问题。

该问题可以总结/简化如下：当我复制带有 XSLT 样式表的（UTF-8 编码的）XHTML 文件时，某些字符显示错误。当我只显示同一个 XHTML 文件时，所有字符都正确显示。

以下文件说明了问题：

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we d&#239;&#223;&#960;&#955;&#509; &#145;special characters&#146;</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

当这样调用此脚本时（例如通过 http://localhost/encoding_test /encoding_test.php），转换后的 XHTML 文档中的所有字符都正常显示，除了 和字符实体（它们打开和关闭单引号）。我不是 Unicode 专家，但有两件事让我印象深刻：

所有其他字符实体都被正确解释（这可能暗示  和 的 UTF-8 性有关)
然而，当 XHTML 文件直接显示时（例如通过 http://localhost/encoding_test/encoding_test.php?bypass=true)，所有字符都正确显示。

我想我已经在任何可能的地方为输出声明了 UTF-8 编码。其他人是否可能看到问题所在并可以纠正？

提前致谢！

罗恩·范登布兰登

原文

I'm facing a nasty encoding issue when transforming XML via XSLT through PHP.

The problem can be summarised/dumbed down as follows: when I copy a (UTF-8 encoded) XHTML file with an XSLT stylesheet, some characters are displayed wrong. When I just show the same XHTML file, all characters come out correctly.

Following files illustrate the problem:

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we dïßπλǽ ‘special characters’</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

When this script is invoked as such (via e.g. http://localhost/encoding_test/encoding_test.php), all characters in the transformed XHTML document come out ok, except for the ‘ and ’ character entities (they're opening and closing single quotation marks). I'm not a Unicode expert, but two things strike me:

all other character entities are interpreted correctly (which could imply something about the UTF-8-ness of ‘ and ’)
yet, when the XHTML file is displayed unmediated (via e.g. http://localhost/encoding_test/encoding_test.php?bypass=true), all characters are displayed properly.

I think I've declared UTF-8 encoding for the output anywhere I could. Do others perhaps see what's wrong and can be righted?

Thanks in advance!

Ron Van den Branden

分享到QQ

分享到微博