基于 Expat 的 xml 解析脚本在 Linux 上不起作用，但在 Windows 上起作用

发布于 2024-10-18 22:17:04 字数 718 浏览 7 评论 0原文

我正在用 python 编写一组工具，用于从交通模拟软件生成的一些 xml 文件中提取数据。由于生成的文件可能相当大，我使用 xml.parsers.expat 来解析它们。

问题是，当我在 Windows XP 机器上运行我的脚本时，它工作得很好，但在家里的 Ubuntu 10.10 上，在同一个文件上我收到以下错误：
ExpatError: not well-formed (invalid token): line 1, column 0

文件最初是用 utf-8 编码的，标签中声明的编码是 ascii，所以尝试将其更改为 utf-8 （或UTF8或utf8）没有成功。由于BOM不存在，我尝试编写它，但仍然没有成功。我还尝试用 Unix 换行符 (CR) 替换 Windows 换行符 (CR/LF)。也没有成功。

另外，工作中的 python 版本是 2.7.1，在我的 Ubuntu 机器上是 2.6.6，但不认为我的问题与此相关：几周前我将工作计算机的 Python 从 2.6 升级到 2.7，没有出现任何问题。

由于我不是这里的专家，我已经没有想法了，有什么提示吗？

编辑：经过进一步调查（我现在很头疼，我讨厌 Unicode 相关的麻烦），看起来问题是通过将系统环境变量 LANG、LC_ALL 和 LANGUAGE 正确设置为（在我的例子中）“fr_FR.utf-8”来解决的。我不明白为什么他们一开始也不明白为什么现在，它起作用了......

我感谢你们的帮助！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

辞旧 2024-10-25 22:17:04

文档摘录：

xml.parsers.expat.XML_ERROR_INVALID_TOKEN
当输入字节无法正确分配给字符时引发；例如，UTF-8 输入流中的 NUL 字节（值 0）。

ExpatError.lineno
检测到错误的行号。第一行编号为 1。

ExpatError.offset
发生错误的行中的字符偏移量。第一列编号为 0。

以上情况往往表明文件中的第一个字节有问题。

从原始文件开始，即在 Windows 上运行的文件。编辑您的问题以显示执行此操作的结果：

python -c "print repr(open('win_ok_file.xml', 'rb').read(200))"

这将明确显示文件中前 200 个字节中的内容。

还向我们展示您的代码的简化版本，您已经检查过该版本可以在 Windows 上运行以克服最初的错误，但会在 Linux 上重现该问题。

一些断言，无论其价值如何：

“该文件最初编码为
utf-8 和中声明的编码
标签是 ascii”...如果
XML 声明中的编码是
“ascii”但有非 ASCII
文件中的字符，符合
解析器应该引发异常。
您确定您所报告的内容吗？
XML 的默认编码
文档是UTF-8。换句话说，
如果编码中没有提到
XML 声明，或者没有
XML 声明根本没有，解析器是
需要使用 UTF-8 进行解码。
将UTF-8 BOM放在开头是
更有可能是阻碍而不是帮助。
XML 标准要求解析器
接受 CR 作为 XML 中的有效字节
文件然后立即假装
它不存在（除了可能在一个
元素与
xmlns:space="preserve")。改变
CR LF 到 LF 不是一个好主意。

还有一些问题：“相当大”的文件有多少字节？您是否考虑过使用 xml.etree.cElementTree 或 lxml 中的 iterparse() ？

Excerpts from the documentation:

xml.parsers.expat.XML_ERROR_INVALID_TOKEN
Raised when an input byte could not properly be assigned to a character; for example, a NUL byte (value 0) in a UTF-8 input stream.

ExpatError.lineno
Line number on which the error was detected. The first line is numbered 1.

ExpatError.offset
Character offset into the line where the error occurred. The first column is numbered 0.

The above tends to indicate that you have a problem with the very first byte in your file.

Start with the original file, the one that worked on Windows. Edit your question to show the results of doing this:

python -c "print repr(open('win_ok_file.xml', 'rb').read(200))"

which will show unambiguously what is in the first 200 bytes in your file.

Also show us a cut-down version of your code that you have checked will work on Windows to get past the initial error, but reproduces the problem on Linux.

Some assertions, for what they are worth:

"The file was originally encoded in
utf-8 and the encoding declared in
the tag was ascii" ... If the
encoding in the XML declaration is
"ascii" but there are non-ASCII
characters in the file, complying
parsers should raise an exception.
Are you sure of what you report?
The default encoding for XML
documents is UTF-8. In other words,
if the encoding is not mentioned in
the XML declaration, or there is no
XML declaration at all, the parser is
required to decode using UTF-8.
Putting a UTF-8 BOM at the start is
more likely to hinder than help.
The XML standard requires parsers to
accept CR as a valid byte in an XML
document and then immediately pretend
it didn't exist (except maybe in an
element with
xmlns:space="preserve"). Changing
CR LF to LF is not a good idea.

And some questions: How many bytes in a "quite big" file? Have you considered using iterparse() from xml.etree.cElementTree or lxml?

回复收藏 0 原文

刘备忘录 2024-10-25 22:17:04

我遇到了同样的问题，并且，我没有尝试像这样直接解析文件：

document = xmltodict.parse("myfile.xml") # Parse the read document string

而是通过对象打开先前的 xml 文档来间接解析它，如下所示：

document_file = open("myfile.xml", "r") # Open a file in read-only mode
original_doc = document_file.read() # read the file object
document = xmltodict.parse(original_doc) # Parse the read document string

并且它有效。

I had the same problem, and, instead of trying to parse directly the file like this:

document = xmltodict.parse("myfile.xml") # Parse the read document string

I parsed it indirectly, by opening previosly the xml document through a object, like this:

document_file = open("myfile.xml", "r") # Open a file in read-only mode
original_doc = document_file.read() # read the file object
document = xmltodict.parse(original_doc) # Parse the read document string

and it worked.

回复收藏 0 原文

~没有更多了~