通过 PHP 使用 XSLT 的 UTF-8 编码问题
当通过 PHP 通过 XSLT 转换 XML 时,我遇到了一个令人讨厌的编码问题。
该问题可以总结/简化如下:当我复制带有 XSLT 样式表的(UTF-8 编码的)XHTML 文件时,某些字符显示错误。当我只显示同一个 XHTML 文件时,所有字符都正确显示。
以下文件说明了问题:
XHTML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>encoding test</title>
</head>
<body>
<p>This is how we dïßπλǽ ‘special characters’</p>
</body>
</html>
XSLT
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
PHP
<?php
$xml_file = 'encoding_test.xml';
$xsl_file = 'encoding_test.xsl';
$xml_doc = new DOMDocument('1.0', 'utf-8');
$xml_doc->load($xml_file);
$xsl_doc = new DOMDocument('1.0', 'utf-8');
$xsl_doc->load($xsl_file);
$xp = new XsltProcessor();
$xp->importStylesheet($xsl_doc);
// alllow to bypass XSLT transformation with bypass=true request parameter
if ($bypass = $_GET['bypass']) {
echo file_get_contents($xml_file);
}
else {
echo $xp->transformToXML($xml_doc);
}
?>
当这样调用此脚本时(例如通过 http://localhost/encoding_test /encoding_test.php),转换后的 XHTML 文档中的所有字符都正常显示,除了 ‘和’字符实体(它们打开和关闭单引号)。我不是 Unicode 专家,但有两件事让我印象深刻:
- 所有其他字符实体都被正确解释(这可能暗示
‘
和的 UTF-8 性有关’
) - 然而,当 XHTML 文件直接显示时(例如通过 http://localhost/encoding_test/encoding_test.php?bypass=true),所有字符都正确显示。
我想我已经在任何可能的地方为输出声明了 UTF-8 编码。其他人是否可能看到问题所在并可以纠正?
提前致谢!
罗恩·范登布兰登
I'm facing a nasty encoding issue when transforming XML via XSLT through PHP.
The problem can be summarised/dumbed down as follows: when I copy a (UTF-8 encoded) XHTML file with an XSLT stylesheet, some characters are displayed wrong. When I just show the same XHTML file, all characters come out correctly.
Following files illustrate the problem:
XHTML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>encoding test</title>
</head>
<body>
<p>This is how we dïßπλǽ special characters</p>
</body>
</html>
XSLT
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
PHP
<?php
$xml_file = 'encoding_test.xml';
$xsl_file = 'encoding_test.xsl';
$xml_doc = new DOMDocument('1.0', 'utf-8');
$xml_doc->load($xml_file);
$xsl_doc = new DOMDocument('1.0', 'utf-8');
$xsl_doc->load($xsl_file);
$xp = new XsltProcessor();
$xp->importStylesheet($xsl_doc);
// alllow to bypass XSLT transformation with bypass=true request parameter
if ($bypass = $_GET['bypass']) {
echo file_get_contents($xml_file);
}
else {
echo $xp->transformToXML($xml_doc);
}
?>
When this script is invoked as such (via e.g. http://localhost/encoding_test/encoding_test.php), all characters in the transformed XHTML document come out ok, except for the and character entities (they're opening and closing single quotation marks). I'm not a Unicode expert, but two things strike me:
- all other character entities are interpreted correctly (which could imply something about the UTF-8-ness of
and
) - yet, when the XHTML file is displayed unmediated (via e.g. http://localhost/encoding_test/encoding_test.php?bypass=true), all characters are displayed properly.
I think I've declared UTF-8 encoding for the output anywhere I could. Do others perhaps see what's wrong and can be righted?
Thanks in advance!
Ron Van den Branden
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
‘
和’
不是可见的 Unicode 字符。它们是单引号的旧 HTML 字符引用1,但是当您使用 XSLT 处理器处理它们时,处理器不会看到单引号,而是看到十进制代码 145 和 146 的 Unicode 字符,即 U+0090 和 U+0091。
这些字符是私人使用的(即该用法不是由 Unicode 联盟定义的)C1 控制代码。
解决方案是使用正确的 Unicode 字符
‘
和’
。1事实上,这些代码映射到 Windows-1252 编码。它们由浏览器显示,但实际上在 HTML 中无效:
‘
and’
are no visible Unicode characters.They are old HTML character references1 for single quotes, but when you process them using an XSLT processor the processor doesn't see single quotes but the Unicode characters with decimal codes 145 and 146, i.e. U+0090 and U+0091.
These characters are private use (i.e. the usage is not defined by the Unicode consortium) C1 control codes.
The solution is to use the correct Unicode characters
‘
and’
.1In fact, these are codes that map to Windows-1252 encoding. They are displayed by browsers but they are actually not valid in HTML: