如何在 PHP 中将 HTML 字符数字转换为普通字符?
我有一些 HTML 数据(我无法控制,只能读取它),其中包含很多斯堪的纳维亚字符(å、ä、ö、æ、ø 等)。这些“特殊”字符存储为 HTML 字符数字 (æ = æ
)。我需要将它们转换为 PHP 中相应的实际字符(或 JavaScript,但我想 PHP 在这里更好......)。似乎 html_entity_decode()
只处理“其他”类型的实体,其中 æ = &#aelig;
。到目前为止,我想到的唯一解决方案是制作一个转换表并将每个字符编号映射到真实字符,但这并不是非常聪明...... 那么,有什么想法吗? ;)
干杯, 克里斯托弗
I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ
). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode()
only handles the "other" kind of entities, where æ = aelig;
. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)
Cheers,
Christofer
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
指该字符的 unicode 值。
所以你可以使用一些正则表达式,例如:
来获取数字,我不知道PHP,但我确信你可以谷歌搜索如何将数字转换为其unicode等效字符。
然后只需将正则表达式匹配替换为字符即可。
编辑:实际上看起来你可以使用这个:
refers to the unicode value of that char.
so you could use some regex like:
to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.
Then simply replace your regex match with the char.
Edit: Actually it looks like you can use this:
我认为 html_entity_decode() 应该可以正常工作。当你尝试时会发生什么:
I think
html_entity_decode()
should work just fine. What happens when you try:在
html_entity_decode()
的 PHP 手册页上,它提供了以下代码,用于在 4.3.0 之前的 PHP 版本中解码数字实体:正如有人在评论中指出的那样,您可能应该替换
chr()
与unichr()
一起处理非 ASCII 字符。然而,看起来
html_entity_decode()
确实应该处理数字实体和文字实体。是否指定了适当的字符集(例如,UTF-8)?On the PHP manual page on
html_entity_decode()
, it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:As someone noted in the comments, you should probably replace
chr()
withunichr()
to deal with non-ASCII characters.However, it looks like
html_entity_decode()
really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?如果您还没有安装多字节字符串函数,您可以使用如下内容:
If you haven't got the luxury of having multibyte string functions installed, you can use something like this: