PHP DOM UTF-8 问题
首先,我的数据库使用 Windows-1250 作为本机字符集。我将数据输出为 UTF-8。我在我的网站上到处使用 iconv() 函数将 Windows-1250 字符串转换为 UTF-8 字符串,效果非常好。
问题是当我使用 PHP DOM 解析数据库中存储的一些 HTML 时(HTML 是 WYSIWYG 编辑器的输出,并且无效,它没有 html、head、body 标签等)。
HTML 可能看起来像这样,例如:
<p>Hello</p>
这是我用来解析数据库中的某个 HTML 的方法:
private function ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters
$doc = new DOMDocument('1.0', 'UTF-8');
// hack to preserve UTF-8 characters
$html = iconv('Windows-1250', 'UTF-8', $slideContent);
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->preserveWhiteSpace = false;
foreach($doc->getElementsByTagName('img') as $t) {
$path = trim($t->getAttribute('src'));
$t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
}
foreach ($doc->getElementsByTagName('object') as $o) {
foreach ($o->getElementsByTagName('param') as $p) {
$path = trim($p->getAttribute('value'));
$p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
}
}
foreach ($doc->getElementsByTagName('embed') as $e) {
if (true === $e->hasAttribute('pluginspage')) {
$path = trim($e->getAttribute('src'));
$e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
} else {
$path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
$path = 'data/media/video/' . $path;
$path = '/clientarea/utils/locate-video?path=' . urlencode($path);
$width = $e->getAttribute('width') . 'px';
$height = $e->getAttribute('height') . 'px';
$a = $doc->createElement('a', '');
$a->setAttribute('href', $path);
$a->setAttribute('style', "display:block;width:$width;height:$height;");
$a->setAttribute('class', 'player');
$e->parentNode->replaceChild($a, $e);
$this->slideContainsVideo = true;
}
}
$html = trim($doc->saveHTML());
$html = explode('<body>', $html);
$html = explode('</body>', $html[1]);
return $html[0];
}
上面方法的输出是垃圾,所有特殊字符都被替换为奇怪的东西,如 ¡
还有一件事。它在我的开发服务器上确实有效。
但它不适用于生产服务器。
有什么建议吗?
生产服务器的 PHP 版本:PHP 版本 5.2.0RC4-dev
开发服务器的 PHP 版本:PHP 版本 5.2.13
更新:
我自己正在研究解决方案。我从这个 PHP 错误报告中得到了灵感(虽然不是真正的错误): http:// /bugs.php.net/bug.php?id=32547
这是我建议的解决方案。我明天会尝试一下,让你知道它是否有效:
private function ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters
$doc = new DOMDocument('1.0', 'UTF-8');
// hack to preserve UTF-8 characters
$html = iconv('Windows-1250', 'UTF-8', $slideContent);
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->preserveWhiteSpace = false;
// this might work
// it basically just adds head and meta tags to the document
$html = $doc->getElementsByTagName('html')->item(0);
$head = $doc->createElement('head', '');
$meta = $doc->createElement('meta', '');
$meta->setAttribute('http-equiv', 'Content-Type');
$meta->setAttribute('content', 'text/html; charset=utf-8');
$head->appendChild($meta);
$body = $doc->getElementsByTagName('body')->item(0);
$html->removeChild($body);
$html->appendChild($head);
$html->appendChild($body);
foreach($doc->getElementsByTagName('img') as $t) {
$path = trim($t->getAttribute('src'));
$t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
}
foreach ($doc->getElementsByTagName('object') as $o) {
foreach ($o->getElementsByTagName('param') as $p) {
$path = trim($p->getAttribute('value'));
$p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
}
}
foreach ($doc->getElementsByTagName('embed') as $e) {
if (true === $e->hasAttribute('pluginspage')) {
$path = trim($e->getAttribute('src'));
$e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
} else {
$path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
$path = 'data/media/video/' . $path;
$path = '/clientarea/utils/locate-video?path=' . urlencode($path);
$width = $e->getAttribute('width') . 'px';
$height = $e->getAttribute('height') . 'px';
$a = $doc->createElement('a', '');
$a->setAttribute('href', $path);
$a->setAttribute('style', "display:block;width:$width;height:$height;");
$a->setAttribute('class', 'player');
$e->parentNode->replaceChild($a, $e);
$this->slideContainsVideo = true;
}
}
$html = trim($doc->saveHTML());
$html = explode('<body>', $html);
$html = explode('</body>', $html[1]);
return $html[0];
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你的“黑客”没有意义。
您正在将 Windows-1250 HTML 文件转换为 UTF-8,然后在前面添加
。这行不通。 HTML 文件的 DOM 扩展:
我建议您从 Windows-1250 转换为 ISO-8859-1 并且不添加任何内容。
编辑 这个建议不是很好,因为 Windows-1250 包含 ISO-8859-1 中没有的字符。由于您处理的片段没有内容类型的
meta
元素,因此您可以添加自己的元素来强制解释为 UTF-8:给出:
Your "hack" doesn't make sense.
You are converting a Windows-1250 HTML file into UTF-8 and then prepending
<?xml encoding="UTF-8">
. This won't work. The DOM extension, for HTML files:I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.
EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without
meta
elements for content-type, you can add your own to force interpretation as UTF-8:gives:
两种解决方案。
您可以将编码设置为标头:
或者您可以将其设置为 META 标记:
编辑:如果这两个设置正确,请执行以下操作:
如果您确信正在发送正确的标头,那么找到错误的最佳机会就是开始查看原始字节。发送到相同浏览器的相同字节将产生相同的结果,因此您需要开始寻找它们不相同的原因。 Fiddler/Wireshark 将对此提供帮助。
Two solutions.
You can either set the encoding as a header:
Or your can set it as a META tag:
EDIT: in the event that both of these are set correctly, do the following:
If you are confident that the correct header is being sent, then your best chance of finding the error is to start looking at raw bytes. Identical bytes sent to an identical browser will yield the same result, so you need to start looking for why they are not identical. Fiddler/Wireshark will help with that.
我也有同样的问题。我的修复方法是使用 notepad++ 并将 php 文档的编码设置为“UTF-8 without BOM”。希望这对其他人有帮助。
I had the same problem. My fix was using notepad++ and setting the encoding of php document to "UTF-8 without BOM". Hope this helps any of others.