将 HTML 字符实体转换为“常规”;字母...为什么它只能部分工作?

发布于 2024-08-23 18:25:37 字数 503 浏览 7 评论 0原文

我正在使用以下所有内容从我的数据库中获取一个名为“code”的字段,删除所有 HTML 实体,并将其“照常”打印到站点:

   <?php $code = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $code);
   $code = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $code); 
   $code = html_entity_decode($code); ?>

但是导出的代码仍然如下所示:

progid:DXImageTransform.Microsoft.AlphaImageLoader(src=’img/the_image.png’);

看看是什么那里发生了什么?我还可以在字符串上运行多少其他东西来将它们变成该死的常规字符?!

谢谢!

杰克

I'm using all of the below to take a field called 'code' from my database, get rid of all the HTML entities, and print it 'as usual' to the site:

   <?php $code = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $code);
   $code = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $code); 
   $code = html_entity_decode($code); ?>

However the exported code still looks like this:

progid:DXImageTransform.Microsoft.AlphaImageLoader(src=’img/the_image.png’);

See what's going on there? How many other things can I run on the string to turn them into darn regular characters?!

Thanks!

Jack

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

未蓝澄海的烟 2024-08-30 18:25:37

是当您读取 UTF-8 编码字符 '(右单引号,U+2019)时得到的结果,就好像它被编码为 windows-1252 一样。换句话说,您有两个问题:您使用错误的编码来读取错误的字符。

HTML 属性值应该用 ASCII 撇号或引号括起来,而不是用大引号括起来。您要转换的数字实体应为 ''(撇号)或 ""(引号)。相反,您似乎有 ,它表示与 相同的字符,或

至于第二个问题,生成的文本似乎被编码为 UTF-8,但在某些时候它被读取为好像是 windows-1252。在 UTF-8 中,字符 ' 由三字节序列 E2 80 99 表示,但 windows-1252 会单独将每个字节转换为 â、<代码>€和<代码>™。无论在哪里发生这种情况,它都不在您向我们展示的代码中。

好消息是您的 preg_replace 代码似乎工作正常。 ;) 但我认为其他人说你可以单独使用 html_entity_decode() 来完成该部分,这是正确的。

’ is what you get when you read the UTF-8 encoded character (RIGHT SINGLE QUOTATION MARK, U+2019) as if it were encoded as windows-1252. In other words, you have two problems: you're using the wrong encoding to read the wrong character.

HTML attribute values are supposed to be enclosed in ASCII apostrophes or quotation marks, not curly quotes. The numeric entities you're converting should be ' or ' (apostrophe) or " or " (quotation mark). Instead, you appear to have , which represents the same character as , , or .

As for the second problem, the resulting text seems to be encoded as UTF-8, but at some point it's being read as if it were windows-1252. In UTF-8, the character is represented by the three-byte sequence E2 80 99, but windows-1252 converts each byte separately, to â, , and . Wherever that's happening, it's not in the code you showed us.

The good news is that your preg_replace code seems to be working correctly. ;) But I think the others are right when they say you can use html_entity_decode() alone for that part.

颜漓半夏 2024-08-30 18:25:37

您使用的字符编码可能与您的页面不同,例如 ISO 与 UTF-8。

It could be you are using a character coding that is different than your page, ISO v.s. UTF-8, for example.

埋葬我深情 2024-08-30 18:25:37

chr 仅适用于 ASCII,因此您的非 ASCII 字符会变得混乱。除非我误解了您要做什么,否则您只需要使用正确的字符集参数对 html_entity_decode() 进行一次调用,并且可以摆脱其他两行。

chr only works on ASCII, so your non-ASCII characters are getting messed up. Unless I'm misunderstanding what you're trying to do, you just need a single call to html_entity_decode() with the correct charset parameter, and can get rid of the other two lines.

深府石板幽径 2024-08-30 18:25:37

虽然名称没有反映它,但 html_entity_decode 也会转换数字字符引用。

// α (U+03B1) == 0xCEB1 (UTF-8)
var_dump("\xCE\xB1" == html_entity_decode('α', ENT_COMPAT, 'UTF-8'));

Although the name doesn’t reflect it, html_entity_decode does also convert numeric character references.

// α (U+03B1) == 0xCEB1 (UTF-8)
var_dump("\xCE\xB1" == html_entity_decode('α', ENT_COMPAT, 'UTF-8'));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文