基于 Expat 的 xml 解析脚本在 Linux 上不起作用,但在 Windows 上起作用
我正在用 python 编写一组工具,用于从交通模拟软件生成的一些 xml 文件中提取数据。由于生成的文件可能相当大,我使用 xml.parsers.expat 来解析它们。
问题是,当我在 Windows XP 机器上运行我的脚本时,它工作得很好,但在家里的 Ubuntu 10.10 上,在同一个文件上我收到以下错误:ExpatError: not well-formed (invalid token): line 1, column 0
文件最初是用 utf-8 编码的,标签中声明的编码是 ascii,所以尝试将其更改为 utf-8 (或UTF8或utf8)没有成功。由于BOM不存在,我尝试编写它,但仍然没有成功。我还尝试用 Unix 换行符 (CR) 替换 Windows 换行符 (CR/LF)。也没有成功。
另外,工作中的 python 版本是 2.7.1,在我的 Ubuntu 机器上是 2.6.6,但不认为我的问题与此相关:几周前我将工作计算机的 Python 从 2.6 升级到 2.7,没有出现任何问题。
由于我不是这里的专家,我已经没有想法了,有什么提示吗?
编辑: 经过进一步调查(我现在很头疼,我讨厌 Unicode 相关的麻烦),看起来问题是通过将系统环境变量 LANG、LC_ALL 和 LANGUAGE 正确设置为(在我的例子中)“fr_FR.utf-8”来解决的。我不明白为什么他们一开始也不明白为什么现在,它起作用了......
我感谢你们的帮助!
I'm writing a set of tool in python to extract data from some xml files that are generated by a traffic simulation software. As the resulting files can be quite big I use the xml.parsers.expat to parse them.
The issue is, when I run my scripts at work on a Windows XP machine it work perfectly but at home, on Ubuntu 10.10, on the very same file I get the following error :ExpatError: not well-formed (invalid token): line 1, column 0
The file was originally encoded in utf-8 and the encoding declared in the tag was ascii so try to change it to utf-8 (or UTF8 or utf8) without success. As the BOM was absent I tryed to write it, still without success. I also tried to replace Windows line break (CR/LF) by Unix ones (CR).Without any success too.
Also the python's version at work is 2.7.1, on my Ubuntu box it's 2.6.6, but don't think my issue is related that : I upgraded my work computer's Python from 2.6 to 2.7 a few weeks ago without trouble.
As I'm not an expert here, I'm running out of idea, any hint ?
Edit:
After further investigation (I got an headache now, I hate Unicode related trouble) it look like the issue was solved by setting properly the system environment variable LANG, LC_ALL and LANGUAGE to (in my case) "fr_FR.utf-8". I don't understand why they weren't at first neither why now, it work...
I thank you guys for the hand !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
文档摘录:
xml.parsers.expat.XML_ERROR_INVALID_TOKEN
当输入字节无法正确分配给字符时引发;例如,UTF-8 输入流中的 NUL 字节(值 0)。
ExpatError.lineno
检测到错误的行号。第一行编号为 1。
ExpatError.offset
发生错误的行中的字符偏移量。第一列编号为 0。
以上情况往往表明文件中的第一个字节有问题。
从原始文件开始,即在 Windows 上运行的文件。编辑您的问题以显示执行此操作的结果:
这将明确显示文件中前 200 个字节中的内容。
还向我们展示您的代码的简化版本,您已经检查过该版本可以在 Windows 上运行以克服最初的错误,但会在 Linux 上重现该问题。
一些断言,无论其价值如何:
“该文件最初编码为
utf-8 和中声明的编码
标签是 ascii”...如果
XML 声明中的编码是
“ascii”但有非 ASCII
文件中的字符,符合
解析器应该引发异常。
您确定您所报告的内容吗?
XML 的默认编码
文档是UTF-8。换句话说,
如果编码中没有提到
XML 声明,或者没有
XML 声明根本没有,解析器是
需要使用 UTF-8 进行解码。
将UTF-8 BOM放在开头是
更有可能是阻碍而不是帮助。
XML 标准要求解析器
接受
CR
作为 XML 中的有效字节文件然后立即假装
它不存在(除了可能在一个
元素与
xmlns:space="preserve"
)。改变CR LF
到LF
不是一个好主意。还有一些问题:“相当大”的文件有多少字节?您是否考虑过使用
xml.etree.cElementTree
或lxml
中的iterparse()
?Excerpts from the documentation:
xml.parsers.expat.XML_ERROR_INVALID_TOKEN
Raised when an input byte could not properly be assigned to a character; for example, a NUL byte (value 0) in a UTF-8 input stream.
ExpatError.lineno
Line number on which the error was detected. The first line is numbered 1.
ExpatError.offset
Character offset into the line where the error occurred. The first column is numbered 0.
The above tends to indicate that you have a problem with the very first byte in your file.
Start with the original file, the one that worked on Windows. Edit your question to show the results of doing this:
which will show unambiguously what is in the first 200 bytes in your file.
Also show us a cut-down version of your code that you have checked will work on Windows to get past the initial error, but reproduces the problem on Linux.
Some assertions, for what they are worth:
"The file was originally encoded in
utf-8 and the encoding declared in
the tag was ascii" ... If the
encoding in the XML declaration is
"ascii" but there are non-ASCII
characters in the file, complying
parsers should raise an exception.
Are you sure of what you report?
The default encoding for XML
documents is UTF-8. In other words,
if the encoding is not mentioned in
the XML declaration, or there is no
XML declaration at all, the parser is
required to decode using UTF-8.
Putting a UTF-8 BOM at the start is
more likely to hinder than help.
The XML standard requires parsers to
accept
CR
as a valid byte in an XMLdocument and then immediately pretend
it didn't exist (except maybe in an
element with
xmlns:space="preserve"
). ChangingCR LF
toLF
is not a good idea.And some questions: How many bytes in a "quite big" file? Have you considered using
iterparse()
fromxml.etree.cElementTree
orlxml
?我遇到了同样的问题,并且,我没有尝试像这样直接解析文件:
而是通过对象打开先前的 xml 文档来间接解析它,如下所示:
并且它有效。
I had the same problem, and, instead of trying to parse directly the file like this:
I parsed it indirectly, by opening previosly the xml document through a object, like this:
and it worked.