如何将 HTML 字符引用 (ף) 转换为常规 UTF-8?
我有一些希伯来语网站,其中包含字符引用,例如: נוף
如果我将文件另存为 .html 并在其中查看,我只能查看这些字母UTF-8 编码。
如果我尝试将其作为常规文本文件打开,则 UTF-8 编码不会显示正确的输出。
我注意到,如果我打开文本编辑器并以 UTF-8 编写希伯来语,则在本例中每个字符占用两个字节而不是 4 个字节行 (ו
)
如果这是 UTF- 任何想法16 或任何其他类型的 UTF 字母表示形式?
如果可能的话,如何将其转换为普通字母?
使用最新的 PHP 版本。
I have some hebrew websites that contains character references like: נוף
I can only view these letters if I save the file as .html and view in UTF-8 encoding.
If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.
I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (ו
)
Any ideas if this is UTF-16 or any other kind of UTF representation of letters?
How can I convert it to normal letters if possible?
Using latest PHP version.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这些是引用 ISO 中字符的字符引用 10646,通过以十进制 (
&#n;
) 或十六进制 (&#xn) 指定该字符的代码点;
) 符号。您可以使用
html_entity_decode
来解码此类字符引用以及 < 的实体引用a href="http://www.w3.org/TR/html4/sgml/entities.html" rel="nofollow noreferrer">为 HTML 4 定义的实体,因此其他引用如& lt;
、>
、&
也将被解码:如果您只想解码数字字符引用,您可以使用此:
作为 YuriKolovsky 和 thirtydot< /a> 在另一个问题中指出,浏览器供应商似乎确实“默默地”同意了有关字符引用映射的某些内容,这确实与规范不同,并且没有记录。
似乎有一些字符引用通常会映射到 Latin 1 补充 但实际上它们映射到不同的字符上。这是由于映射更倾向于映射 Windows-1252 中的字符而不是 ISO 8859-1(Unicode 字符集建立在 ISO 8859-1 上)。 Jukka Korpela 撰写了一篇关于此主题的详细文章。
现在,这是上面提到的处理此怪癖的函数的扩展:
如果 匿名函数 不可用(引入对于 5.3.0),您还可以使用
create_function
:这是另一个尝试的函数为了遵守 HTML 5 的行为:
我还注意到在 PHP 5.4.0 中,
html_entity_decode
函数 添加了另一个标志针对 HTML 5 行为命名为 ENT_HTML5。Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (
&#n;
) or hexadecimal (&#xn;
) notation.You can use
html_entity_decode
that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like<
,>
,&
will also get decoded:If you just want to decode the numeric character references, you can use this:
As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.
There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.
Now here’s an extension to the function mentioned above that handles this quirk:
If anonymous functions are not available (introduced with 5.3.0), you could also use
create_function
:Here’s another function that tries to comply to the behavior of HTML 5:
I’ve also noticed that in PHP 5.4.0 the
html_entity_decode
function was added another flag named ENT_HTML5 for HTML 5 behavior.这些是 XML 字符引用。您想使用
html_entity_decode()
解码它们:如需了解更多信息,您可以在 Google 中搜索相关实体。请参阅以下几个示例:
Those are XML Character References. You want to decode them using
html_entity_decode()
:For more information, you can search Google for the entity in question. See these few examples: