将 VT100 转义码存储在 XML 文件中
我正在编写一个记录终端交互的Python程序(类似于脚本程序),并且我想以XML格式存储日志。
问题在于终端交互包含 VT100 转义码。如果我将数据以 UTF-8 编码写入文件,Python 不会抱怨,例如:
...
pid, fd = pty.fork()
if pid==0:
os.execvp("bash",("bash","-l"))
else:
# Lots of TTY-related stuff here
# see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
fout = codecs.open("session.xml", encoding="utf-8", mode="w")
fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fout.write("<session>\n")
...
r, w, e = select.select([0, fd], [], [], 1)
for f in r:
if f==fd:
fout.write("<entry><![CDATA[")
buf = os.read(fd, 1024)
fout.write(buf)
fout.write("]]></entry>\n")
else:
....
fout.write("</session>")
fout.close()
此脚本“有效”,因为它将文件写入磁盘,但生成的文件不是正确的 utf-8,这会导致像 etree 这样的 XML 解析器会吐出转义码。
解决这个问题的一种方法是先过滤掉转义码 。但是是否可以做这样的事情,其中维护转义码并且生成的文件可以由 etree 等 XML 工具解析?
I'm writing a Python program that logs terminal interaction (similar to the script program), and I'd like to store the log in XML format.
The problem is that the terminal interaction includes VT100 escape codes. Python doesn't complain if I write the data to a file as UTF-8 encoded, e.g.:
...
pid, fd = pty.fork()
if pid==0:
os.execvp("bash",("bash","-l"))
else:
# Lots of TTY-related stuff here
# see http://groups.google.com/group/comp.lang.python/msg/de40b36c6f0c53cc
fout = codecs.open("session.xml", encoding="utf-8", mode="w")
fout.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fout.write("<session>\n")
...
r, w, e = select.select([0, fd], [], [], 1)
for f in r:
if f==fd:
fout.write("<entry><![CDATA[")
buf = os.read(fd, 1024)
fout.write(buf)
fout.write("]]></entry>\n")
else:
....
fout.write("</session>")
fout.close()
This script "works" in the sense that it writes a file to disk, but the resulting file is not proper utf-8, which causes XML parsers like etree to barf on the escape codes.
One way to deal with this is to filter out the escape codes first. But if is it possible to do something like this where the escape codes are maintained and the resulting file can be parsed by XML tools like etree?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您的问题不在于控制代码不是正确的 UTF-8,而是,它只是 ASCII
ESC
并且朋友不是正确的 XML 字符,即使在 CDATA 部分也是如此。XML 1.0 中值小于 U+0020 的唯一有效 XML 字符是 U+0009(制表符)、U+000A(换行符)和 U+000D(回车符)。如果您想记录涉及其他代码(例如转义(U+001B))的内容,那么您必须以某种方式转义它们。没有其他选择。
Your problem is not that the control codes aren't proper UTF-8, they are, it's just ASCII
ESC
and friends are not proper XML characters, even inside a CDATA section.The only valid XML characters in XML 1.0 which have values less than U+0020 are U+0009 (tab), U+000A (newline) amd U+000D (carriage return). If you want to record things involving other codes such as escape (U+001B) then you will have to escape them in some way. There is no other option.
正如 Charles 所说,大多数控制代码可能根本不包含在 XML 1.0 文件中。
但是,如果您可以接受 XML 1.1 的要求,则可以在那里使用它们。它们不能作为原始字符包含在内,但可以作为字符引用。例如:
因为您无法在 CDATA 部分中写入字符引用(它们只会被解释为与号哈希-...),所以您将不得不丢失
包装器并手动将
&<>
字符转义为其实体引用等效项。请注意,无论如何您都应该这样做:CDATA 部分并不能免除您对文本转义的责任,因为如果其中的文本包含序列
]]>
,它们就会失败。 (因为无论如何您总是必须进行一些转义,这使得 CDATA 部分在大多数情况下毫无用处。)XML 1.1 对控制代码更加宽松,但并非所有内容都支持它,您仍然不能包含NUL 字符 (< /代码>)。
�
)。一般来说,在 XML 中包含控制字符并不是一个好主意。您可以使用临时编码方案来适应二进制; base-64 很流行,但不太可读。如果只有您自己的应用程序将处理文件,替代方案可能包括使用私人使用区域中的随机字符作为替代品,或者将它们编码为元素(例如As Charles said, most control codes may not be included in a XML 1.0 file at all.
However if you can live with requiring XML 1.1, you can use them there. They can't be included as raw characters, but can be as character references. eg:
because you can't write character references in a CDATA section (they'd just be interpreted as ampersand-hash-...), you would have to lose the
<![CDATA[
wrapper and manually escape&<>
characters to their entity-reference equivalents.Note that you should do this anyway: CDATA sections do not absolve you of the responsibility for text escaping, because they will fail if the text inside included the sequence
]]>
. (Since you always have to do some escaping anyway, this makes CDATA sections pretty useless most of the time.)XML 1.1 is more lenient about control codes but not everything supports it and you still can't include the NUL character (
�
). In general it's not a good idea to include control characters in XML. You could use an ad-hoc encoding scheme to fit binary in; base-64 is popular, but not very human-readable. Alternatives might include using random characters from the Private Use Area as substitutes, if it's only ever your own application that will be handling the files, or encoding them as elements (eg<esc color="1"/>
).您是否尝试将数据放入 CDATA 部分?这应该可以防止解析器尝试读取标签的内容。
http://en.wikipedia.org/wiki/CDATA
Did you try put your data inside a CDATA section ? this should prevent the parser to try to read the content of the tag.
http://en.wikipedia.org/wiki/CDATA