PHP SimpleXML 返回的值有奇怪的字符代替连字符和撇号
我环顾四周,似乎找不到解决方案,所以就在这里。
我有以下代码:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
这是简单的版本,但即使尝试此操作,任何连字符 r 撇号也会变成:^a(欧元符号)商标符号。
我尝试过的事情有:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
htmlspecial(我再次知道不能使用连字符)、htmlentities 和其他一些技巧。
现在的情况是我从提要中获取 XML 文件,因此我无法更改它们,但它们是相当标准的。带有连字符等的文本封装在 cdata 标签中,编码为 UTF-8。如果我检查源代码,我会看到源代码中的连字符和撇号。
现在,为了查看编码是否已关闭、标签错误或其他奇怪的情况,我尝试查看原始 XML 文件,果然它显示正确。
我确信,在急于寻找答案的过程中,我忽略了一些简单的事情,而且事实上,这确实是我第一次使用 SimpleXML,我错过了一个非常简单的解决方案。只是不要因此而拒绝我,我确实尝试过自己找到答案。
再次感谢。
I have looked around and can't seem to find a solution so here it is.
I have the following code:
$file = "adhddrugs.xml";
$xmlstr = simplexml_load_file($file);
echo $xmlstr->report_description;
This is the simple version, but even trying this any hyphens r apostrophes are turned into: ^a (euro sign) trademark sign.
Things I have tried are:
echo = (string)$xmlstr->report_description; /* did not work */
echo = addslashes($xmlstr->report_description); /* yes I know this doesnt work with hyphens, was mainly trying to see if I could escape the apostrophes */
echo = addslashes((string)$xmlstr->report_description); /* did not work */
also htmlspecial(again i know does not work with hyphens), htmlentities, and a few other tricks.
Now the situation is I am getting the XML files from a feed so I cannot change them, but they are pretty standard. The text with the hyphens etc are encapsulated in a cdata tag and encoding is UTF-8. If I check the source I am shown the hyphens and apostrophes in the source.
Now just to see if the encoding was off or mislabeled or something else weird, I tried to view the raw XML file and sure enough it is displayed correctly.
I am sure that in my rush to find the answer I have overlooked something simple and the fact that this is really the first time I have ever used SimpleXML I am missing a very simple solution. Just don't dock me for it I really did try and find the answer on my own.
Thanks again.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这是由不正确的字符集猜测(以及可能的重新编码)引起的。
如果文本包含“大写撇号”=“右单引号”= U+2019 字符,则将其保存为 UTF-8 编码会导致字节
0xE2 0x80 0x99
。如果再次读取同一文件,假设其字符集是 windows-1252,则撇号字符 (0xE2 0x80 0x99
) 的字节流将被解释为字符â €™
(=带扬抑符的小“a”、欧元符号、商标符号)。同样,如果此错误解释的文本保存为 UTF-8,则原始字符会生成字节流0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
摘要:您的原始数据是 UTF-8,并且代码的某些部分读取数据时假定它是 windows-1252(或 ISO-8859-1,通常实际上被视为 windows-1252)。这种字符集假设的一个可能原因是 HTTP 的默认字符集是 ISO-8859-1。 “当发送者没有提供明确的字符集参数时,“文本”类型的媒体子类型在通过 HTTP 接收时被定义为具有默认字符集值“ISO-8859-1”。资料来源:RFC 2616,超文本传输协议 - HTTP/1.1
PS。这是一个非常常见的问题。只需使用查询
doesn’t -doesn't
进行 Google 或 Bing 搜索,您就会看到许多页面都存在相同的编码错误。This is caused by incorrect charset guessing (and possibly recoding).
If a text contains a "curly apostrophe" = "Right single quotation mark" = U+2019 character, saving it in UTF-8 encoding results in bytes
0xE2 0x80 0x99
. If the same file is then read again assuming its charset is windows-1252, the byte stream of the apostrophe character (0xE2 0x80 0x99
) is interpreted as characters’
(=small "a" with circumflex, euro sign, trademark sign). Again if this incorrectly interpreted text is saved as UTF-8 the original character results in byte stream0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2
Summary: Your original data is UTF-8 and some part of your code that reads the data assumes it is windows-1252 (or ISO-8859-1, which is usually actually treated as windows-1252). A probable reason for this charset assumption is that default charset for HTTP is ISO-8859-1. 'When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.' Source: RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1
PS. this is a very common problem. Just do a Google or Bing search with query
doesn’t -doesn't
and you'll see many pages with this same encoding error.您知道文档的字符集吗?
如果您还没有这样做,您可以在打印任何内容之前执行
header('Content-Type: text/html; charset=utf-8');
。Do you know the document's character set?
You could do
header('Content-Type: text/html; charset=utf-8');
before any content is printed, if you havent already.确保您也已将 SimpleXML 设置为使用 UTF-8。
确保所有实体均使用十六进制表示法进行编码,而不是 HTML 实体。
也可能:
会有所帮助。
Make sure you have set up SimpleXML to use UTF-8 too.
Be sure that all the entities are encoded using hex notation, not HTML entities.
Also maybe:
will help.
这是在页面的
部分声明不正确的字符集(或者未声明和使用不带重音符号和特殊字符的默认字符集)的症状。
这对于拉丁语言来说很有效。
对于新手来说,浏览器的 html 页面有一个基本布局,带有 HEAD 或 HEADER,用于告诉浏览器有关页面的一些基本信息,以及预加载页面将用来实现其功能的一些脚本。
如果省略
部分,html 将使用默认值(认为某些事情是理所当然的 - 例如使用北美字符集,其中不包含许多重音字母,这些字母显示为“奇怪的字符” ”。
This is a symptom of declaring an incorrect character set in the
<head>
section of your page (or not declaring and using default character set without accents and special characters).This does the trick for latin languages.
For TOTAL NEWBIES, html pages for browsers have a basic layout, with a HEAD or HEADER which serves to tell the browser some basic stuff about the page, as well as preload some scripts that the page will use to achieve its functionality(ies).
if the
<head>
section is omitted, html will use defaults (take some things for granted - like using the northamerican character set, which does NOT include many accented letters, whch show up as "weird characters".