PHP SimpleXML 返回的值有奇怪的字符代替连字符和撇号

发布于 2024-10-15 03:28:54 字数 915 浏览 4 评论 0原文

我环顾四周,似乎找不到解决方案,所以就在这里。

我有以下代码:

$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;

这是简单的版本,但即使尝试此操作,任何连字符 r 撇号也会变成:^a(欧元符号)商标符号。

我尝试过的事情有:

echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */

htmlspecial(我再次知道不能使用连字符)、htmlentities 和其他一些技巧。

现在的情况是我从提要中获取 XML 文件,因此我无法更改它们,但它们是相当标准的。带有连字符等的文本封装在 cdata 标签中,编码为 UTF-8。如果我检查源代码,我会看到源代码中的连字符和撇号。

现在,为了查看编码是否已关闭、标签错误或其他奇怪的情况,我尝试查看原始 XML 文件,果然它显示正确。

我确信,在急于寻找答案的过程中,我忽略了一些简单的事情,而且事实上,这确实是我第一次使用 SimpleXML,我错过了一个非常简单的解决方案。只是不要因此而拒绝我,我确实尝试过自己找到答案。

再次感谢。

I have looked around and can't seem to find a solution so here it is.

I have the following code:

$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;

This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.

Things I have tried are:

echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */

also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.

Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.

Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.

I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.

Thanks again.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

花之痕靓丽 2024-10-22 03:28:54

这是简单的版本,但即使
尝试这个任何连字符撇号
变成:^a(欧元符号)
商标标志。

这是由不正确的字符集猜测(以及可能的重新编码)引起的。

如果文本包含“大写撇号”=“右单引号”= U+2019 字符,则将其保存为 UTF-8 编码会导致字节 0xE2 0x80 0x99。如果再次读取同一文件,假设其字符集是 windows-1252,则撇号字符 (0xE2 0x80 0x99) 的字节流将被解释为字符 â €™(=带扬抑符的小“a”、欧元符号、商标符号)。同样,如果此错误解释的文本保存为 UTF-8,则原始字符会生成字节流 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2

摘要:您的原始数据是 UTF-8,并且代码的某些部分读取数据时假定它是 windows-1252(或 ISO-8859-1,通常实际上被视为 windows-1252)。这种字符集假设的一个可能原因是 HTTP 的默认字符集是 ISO-8859-1。 “当发送者没有提供明确的字符集参数时,“文本”类型的媒体子类型在通过 HTTP 接收时被定义为具有默认字符集值“ISO-8859-1”。资料来源:RFC 2616,超文本传输​​协议 - HTTP/1.1

PS。这是一个非常常见的问题。只需使用查询 doesn’t -doesn't 进行 Google 或 Bing 搜索,您就会看到许多页面都存在相同的编码错误。

This is the simple version, but even
trying this any hyphens apostrophes
are turned into: ^a (euro sign)
trademark sign.

This is caused by incorrect charset guessing (and possibly recoding).

If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes 0xE2 0x80 0x99. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99) is interpreted as characters ’ (=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream 0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2

Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1

PS. this is a very common problem. Just do a Google or Bing search with query doesn’t -doesn't and you'll see many pages with this same encoding error.

轻拂→两袖风尘 2024-10-22 03:28:54

您知道文档的字符集吗?

如果您还没有这样做,您可以在打印任何内容之前执行 header('Content-Type: text/html; charset=utf-8');

Do you know the document's character set?

You could do header('Content-Type: text/html; charset=utf-8'); before any content is printed, if you havent already.

唠甜嗑 2024-10-22 03:28:54

确保您也已将 SimpleXML 设置为使用 UTF-8。

确保所有实体均使用十六进制表示法进行编码,而不是 HTML 实体。

也可能:

$string = html_entity_decode($string, ENT_QUOTES, "utf-8");

会有所帮助。

Make sure you have set up SimpleXML to use UTF-8 too.

Be sure that all the entities are encoded using hex notation, not HTML entities.

Also maybe:

$string = html_entity_decode($string, ENT_QUOTES, "utf-8");

will help.

耶耶耶 2024-10-22 03:28:54

这是在页面的 部分声明不正确的字符集(或者未声明和使用不带重音符号和特殊字符的默认字符集)的症状。

这对于拉丁语言来说很有效。

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

对于新手来说,浏览器的 html 页面有一个基本布局,带有 HEAD 或 HEADER,用于告诉浏览器有关页面的一些基本信息,以及预加载页面将用来实现其功能的一些脚本。

<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 </head>
 <body>
  Hello world
 </body>
</html>

如果省略 部分,html 将使用默认值(认为某些事情是理所当然的 - 例如使用北美字符集,其中不包含许多重音字母,这些字母显示为“奇怪的字符” ”。

This is a symptom of declaring an incorrect character set in the <head> section of your page (or not declaring and using default character set without accents and special characters).

This does the trick for latin languages.

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).

<html>
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
 </head>
 <body>
  Hello world
 </body>
</html>

if the <head> section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文