当前位置：文江博客话题详情

PHP DOM UTF-8 问题

发布于 2024-09-15 18:58:14 字数 5126 浏览 8 评论 0 原文

首先，我的数据库使用 Windows-1250 作为本机字符集。我将数据输出为 UTF-8。我在我的网站上到处使用 iconv() 函数将 Windows-1250 字符串转换为 UTF-8 字符串，效果非常好。

问题是当我使用 PHP DOM 解析数据库中存储的一些 HTML 时（HTML 是 WYSIWYG 编辑器的输出，并且无效，它没有 html、head、body 标签等）。

HTML 可能看起来像这样，例如：

<p>Hello</p>

这是我用来解析数据库中的某个 HTML 的方法：

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

上面方法的输出是垃圾，所有特殊字符都被替换为奇怪的东西，如 ¡

还有一件事。它在我的开发服务器上确实有效。

但它不适用于生产服务器。

有什么建议吗？

生产服务器的 PHP 版本：PHP 版本 5.2.0RC4-dev

开发服务器的 PHP 版本：PHP 版本 5.2.13

更新：

我自己正在研究解决方案。我从这个 PHP 错误报告中得到了灵感（虽然不是真正的错误）： http:// /bugs.php.net/bug.php?id=32547

这是我建议的解决方案。我明天会尝试一下，让你知道它是否有效：

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

原文

First of all, my database uses Windows-1250 as native charset. I am outputting the data as UTF-8. I'm using iconv() function all over my website to convert Windows-1250 strings to UTF-8 strings and it works perfect.

The problem is when I'm using PHP DOM to parse some HTML stored in the database (the HTML is an output from a WYSIWYG editor and is not valid, it has no html, head, body tags etc).

The HTML could look something like this, for example:

<p>Hello</p>

Here is a method I use to parse a certain HTML from the database:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

The output from the method above is a garbage with all special characters replaced with weird stuff like ÃšÄ�.

One more thing. It does work on my development server.

It does not work on the production server though.

Any suggestions?

PHP version of the production server: PHP Version 5.2.0RC4-dev

PHP version of the development server: PHP Version 5.2.13

UPDATE:

I'm working on a solution myself. I have an inspiration from this PHP bug report (not really a bug though): http://bugs.php.net/bug.php?id=32547

This is my proposed solution. I will try it tomorrow and let you know if it works:

 private function ParseSlideContent($slideContent)
 {
        var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

  $doc = new DOMDocument('1.0', 'UTF-8');

  // hack to preserve UTF-8 characters
  $html = iconv('Windows-1250', 'UTF-8', $slideContent);
  $doc->loadHTML('<?xml encoding="UTF-8">' . $html);
  $doc->preserveWhiteSpace = false;

  // this might work
  // it basically just adds head and meta tags to the document
  $html = $doc->getElementsByTagName('html')->item(0);
  $head = $doc->createElement('head', '');
  $meta = $doc->createElement('meta', '');
  $meta->setAttribute('http-equiv', 'Content-Type');
  $meta->setAttribute('content', 'text/html; charset=utf-8');
  $head->appendChild($meta);
  $body = $doc->getElementsByTagName('body')->item(0);
  $html->removeChild($body);
  $html->appendChild($head);
  $html->appendChild($body);

  foreach($doc->getElementsByTagName('img') as $t) {
   $path = trim($t->getAttribute('src'));
   $t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
  }
  foreach ($doc->getElementsByTagName('object') as $o) {
   foreach ($o->getElementsByTagName('param') as $p) {
    $path = trim($p->getAttribute('value'));
    $p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   }
  }
  foreach ($doc->getElementsByTagName('embed') as $e) {
   if (true === $e->hasAttribute('pluginspage')) {
    $path = trim($e->getAttribute('src'));
    $e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
   } else {
    $path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
    $path = 'data/media/video/' . $path;
    $path = '/clientarea/utils/locate-video?path=' . urlencode($path);
    $width = $e->getAttribute('width') . 'px';
    $height = $e->getAttribute('height') . 'px';
    $a = $doc->createElement('a', '');
    $a->setAttribute('href', $path);
    $a->setAttribute('style', "display:block;width:$width;height:$height;");
    $a->setAttribute('class', 'player');
    $e->parentNode->replaceChild($a, $e);
    $this->slideContainsVideo = true;
   }
  }

  $html = trim($doc->saveHTML());

  $html = explode('<body>', $html);
  $html = explode('</body>', $html[1]);
  return $html[0];
 }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

纵山崖 2024-09-22 18:58:15

你的“黑客”没有意义。

您正在将 Windows-1250 HTML 文件转换为 UTF-8，然后在前面添加。这行不通。 HTML 文件的 DOM 扩展：

采用元 http-equiv 中为“内容类型”指定的字符集。
否则假设 ISO-8859-1

我建议您从 Windows-1250 转换为 ISO-8859-1 并且不添加任何内容。

编辑这个建议不是很好，因为 Windows-1250 包含 ISO-8859-1 中没有的字符。由于您处理的片段没有内容类型的 meta 元素，因此您可以添加自己的元素来强制解释为 UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

给出：

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

Takes the charset specified in a meta http-equiv for "content-type".
Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

gives:

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

回复收藏 0 原文

顾北清歌寒 2024-09-22 18:58:15

两种解决方案。

您可以将编码设置为标头：

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

或者您可以将其设置为 META 标记：

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

编辑：如果这两个设置正确，请执行以下操作：

创建一个小页面，其中包含其中包含 UTF-8 字符。
使用您已有的相同方法编写页面。
使用 Fiddler 或 Wireshark 检查在 DEV 和 PROD 环境中传输的原始字节。您还可以使用 Fiddler/Wireshark 仔细检查标头。

如果您确信正在发送正确的标头，那么找到错误的最佳机会就是开始查看原始字节。发送到相同浏览器的相同字节将产生相同的结果，因此您需要开始寻找它们不相同的原因。 Fiddler/Wireshark 将对此提供帮助。

Two solutions.

You can either set the encoding as a header:

<?php header("Content-Type", "text/html; charset=utf-8"); ?>

Or your can set it as a META tag:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

EDIT: in the event that both of these are set correctly, do the following:

Create a small page that has a UTF-8 character in it.
Write the page in the same method that you already have.
Use Fiddler or Wireshark to examine the raw bytes being transferred in your DEV and PROD environments. You can also double check the headers using Fiddler/Wireshark.

If you are confident that the correct header is being sent, then your best chance of finding the error is to start looking at raw bytes. Identical bytes sent to an identical browser will yield the same result, so you need to start looking for why they are not identical. Fiddler/Wireshark will help with that.

回复收藏 0 原文