Python libxml2解析含有汉字的xml
我在 python 中使用 libxml2 解析中文字符时遇到编码问题,
# coding=utf8
import libxml2
def output(data):
doc = libxml2.parseMemory(data, len(data))
ctxt = doc.xpathNewContext()
res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
print res_rslt[0]
data = '''<r><e RoleID="3247" Name="中文"></e></r>'''
output(data)
输出是
Name="中文"
当我期待
Name="中文"
如何制作它时?
i encountered encoding problems when using libxml2 in python to parse Chinese charactors
# coding=utf8
import libxml2
def output(data):
doc = libxml2.parseMemory(data, len(data))
ctxt = doc.xpathNewContext()
res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
print res_rslt[0]
data = '''<r><e RoleID="3247" Name="中文"></e></r>'''
output(data)
the out put is
Name="中文"
while i'm expecting
Name="中文"
how could i make it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
lxml
,事情变得更容易并且有效。它是libxml2
库的 Pythonic 绑定,并且工作得非常好。是的,还支持
XPath
。文档位于此处。至于你的程序,看看这个:
With
lxml
, things are easier and they work. It is Pythonic binding for thelibxml2
library and works wonderfully.And yes,
XPath
is also supported. The documentation is here.As for your program, have a look at this:
我对这类事情的回答似乎总是“使用美丽汤”。我也总是因为它而受到支持(我认为这表明其他人同意我的观点,认为它很好)。
问题是 libxml2 正在将这些字符转换为正确的 XML 实体,这对于 XML 来说是正确的。 Beautiful Soup 没有任何感觉需要正确的概念 - 所以它只是给你你想要的东西。
(请注意,在这种情况下,使用
u'...'
或'...'
都可以;我只是将其作为unicode
> 因为这样感觉更好 - 无论你做什么,Beautiful Soup 为您提供 Unicode。)My answer to these sorts of things always seems to be "use Beautiful Soup". And I always get upvoted for it, too (which shows, I think, that others agree with me that it's good).
The thing is that libxml2 is converting those characters into the proper XML entities which for XML is correct. Beautiful Soup doesn't have any such notions of feeling a need to be correct - so it just gives you what you want.
(Note in this case that using either
u'...'
or'...'
will work; I just put it as aunicode
because it feels better that way - whatever you do, Beautiful Soup gives you Unicode.)