如何从 xml 文件中删除非法字符?
我正在使用 PHP SimpleXML 方式处理服务器上的 XML 文件。我只需要读取 XML 的内容(我不需要修改它),所以我坚持使用简单易用的 SimpleXML。但是 SimpleXML 在读取某个 XML 文件时遇到问题,因为它有一些非常奇怪的字符。我收到以下错误:
Warning: simplexml_load_file() [function.simplexml-load-file]: data/data.xml:348: parser error : PCDATA invalid Char value 3 in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54 Warning: simplexml_load_file() [function.simplexml-load-file]: Jardin al fte. Hall de recepcion, amplio living comedor. ocina comedor diario c in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54
我无法控制 XML 文件中的内容,因此无法阻止将这些字符添加到文件中。另外,我不知道如何解决这个问题。该文件应该以 utf-8 编码。所以我尝试了从 UTF-8 解码为 ISO-8859-1 以及相反的解码,但没有任何反应。
有人可以帮我吗?我应该尝试更改编码吗?我应该尝试删除这些字符吗?任何事物?
编辑: tangre 字符都是方框图字符(请参阅:http://en.wikipedia。 org/wiki/Box-drawing_characters)
I am using the PHP SimpleXML way of working with XML files on my server. I only need to read the contents of the XML (I have no need to modify it) so I stuck to the simple and easy to use SimpleXML. But SimpleXML is having problems reading a certain XML file because it has some very strange characters. I get the following errors:
Warning: simplexml_load_file() [function.simplexml-load-file]: data/data.xml:348: parser error : PCDATA invalid Char value 3 in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54 Warning: simplexml_load_file() [function.simplexml-load-file]: Jardin al fte. Hall de recepcion, amplio living comedor. ocina comedor diario c in C:\xampp\htdocs\VMP\xintel\analyzer.php on line 54
I have no control of what goes into the XML file, so I can't stop these characters from being added to the file. Also, I don't know how to solve this issue. The file is supposed to be encoded in utf-8. So I tried things like decoding from UTF-8 to ISO-8859-1 and the reverse, but nothing is happening.
Can somebody help me out? Should I try to change the encoding? Should I try to remove those characters? Anything?
Edit: The stangre characters are all box-drawing characters (see: http://en.wikipedia.org/wiki/Box-drawing_characters)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我有一个应用程序从不受信任的来源接收 XML,其中许多来源向我发送未编码的 & 符号。为了解决这个问题,我有一个中间过滤器,它执行单个线性传递并在必要时删除/编码字符。我不知道这对你来说是否可行,但我认为这是一个非常合理的解决方案。
I have an app that receives XML from untrusted sources, many of which send me unencoded ampersands. To solve the problem, I have an intermediate filter that does a single linear pass and gets rid of / encodes characters where necessary. I don't know if that is possible for you but I think it's a pretty reasonable solution.
也许您可以通过 Tidy 传递输入以使其格式良好。在将文件提供给 SimpleXML 之前进行一个简单的预处理步骤。
例如,
tidy::repairFile
看起来很有希望。Maybe you could pass the input through Tidy to make it well-formed. One simple step of pre-processing before you feed the file to SimpleXML.
For example,
tidy::repairFile
looks promising.通常,XML 文件的所有字符都会被解释,除非它们位于 CDATA 部分 => 链接文本
如果不是这样,您的 XML 无效。
Normally all character of an XML file are interpreted unless they are into a CDATA section => link text
If it not the case your XML is invalid.