如何在 PHP 中将 HTML 字符数字转换为普通字符?

发布于 2024-09-18 10:14:44 字数 357 浏览 12 评论 0原文

我有一些 HTML 数据(我无法控制,只能读取它),其中包含很多斯堪的纳维亚字符(å、ä、ö、æ、ø 等)。这些“特殊”字符存储为 HTML 字符数字 (æ = æ)。我需要将它们转换为 PHP 中相应的实际字符(或 JavaScript,但我想 PHP 在这里更好......)。似乎 html_entity_decode() 只处理“其他”类型的实体,其中 æ = &#aelig;。到目前为止,我想到的唯一解决方案是制作一个转换表并将每个字符编号映射到真实字符,但这并不是非常聪明...... 那么,有什么想法吗? ;)

干杯, 克里斯托弗

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)

Cheers,
Christofer

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

萌︼了一个春 2024-09-25 10:14:44
&#NUMBER;

指该字符的 unicode 值。

所以你可以使用一些正则表达式,例如:

/&#(\d+);/g

来获取数字,我不知道PHP,但我确信你可以谷歌搜索如何将数字转换为其unicode等效字符。

然后只需将正则表达式匹配替换为字符即可。

编辑:实际上看起来你可以使用这个:

mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
&#NUMBER;

refers to the unicode value of that char.

so you could use some regex like:

/&#(\d+);/g

to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.

Then simply replace your regex match with the char.

Edit: Actually it looks like you can use this:

mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
绝對不後悔。 2024-09-25 10:14:44

我认为 html_entity_decode() 应该可以正常工作。当你尝试时会发生什么:

echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');

I think html_entity_decode() should work just fine. What happens when you try:

echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
南街九尾狐 2024-09-25 10:14:44

html_entity_decode() 的 PHP 手册页上,它提供了以下代码,用于在 4.3.0 之前的 PHP 版本中解码数字实体:

  $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
  $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

正如有人在评论中指出的那样,您可能应该替换 chr()unichr() 一起处理非 ASCII 字符。

然而,看起来 html_entity_decode() 确实应该处理数字实体和文字实体。是否指定了适当的字符集(例如,UTF-8)?

On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:

  $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
  $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.

However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?

能怎样 2024-09-25 10:14:44

如果您还没有安装多字节字符串函数,您可以使用如下内容:

<?php

    $string = 'Here is a special char æ';

    $list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
        '$matches', 'return decode(array($matches[2]));'
    ), $string);

    echo '<p>', $string, '</p>';
    echo '<p>', $list, '</p>';

    function decode(array $list)
    {
        foreach ($list as $key=>$value) {
            return utf8_encode(chr($value));
        }
    }


?>

If you haven't got the luxury of having multibyte string functions installed, you can use something like this:

<?php

    $string = 'Here is a special char æ';

    $list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
        '$matches', 'return decode(array($matches[2]));'
    ), $string);

    echo '<p>', $string, '</p>';
    echo '<p>', $list, '</p>';

    function decode(array $list)
    {
        foreach ($list as $key=>$value) {
            return utf8_encode(chr($value));
        }
    }


?>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文