Python libxml2解析含有汉字的xml

发布于 2024-10-03 00:26:18 字数 492 浏览 1 评论 0原文

我在 python 中使用 libxml2 解析中文字符时遇到编码问题，

# coding=utf8
import libxml2

def output(data):
  doc = libxml2.parseMemory(data, len(data))
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  print res_rslt[0]

data =  '''<r><e RoleID="3247" Name="中文"></e></r>'''

output(data)

输出是

Name="&#x4E2D;&#x6587;"

当我期待

Name="中文"

如何制作它时？

原文

i encountered encoding problems when using libxml2 in python to parse Chinese charactors

# coding=utf8
import libxml2

def output(data):
  doc = libxml2.parseMemory(data, len(data))
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  print res_rslt[0]

data =  '''<r><e RoleID="3247" Name="中文"></e></r>'''

output(data)

the out put is

Name="中文"

while i'm expecting

Name="中文"

how could i make it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故人如初 2024-10-10 00:26:18

使用lxml，事情变得更容易并且有效。它是 libxml2 库的 Pythonic 绑定，并且工作得非常好。

>>> from lxml import etree
>>> x = etree.fromstring('''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> name = x[0].get('Name')
>>> print name
中文

是的，还支持 XPath。文档位于此处。

至于你的程序，看看这个：

# -*- coding: utf-8 -*-

import libxml2

def output(data):
  doc = libxml2.parseDoc(data)
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  return res_rslt[0]

data =  u'''<?xml version="1.0" encoding="UTF-8"?><r><e RoleID="3247" Name="中文"></e></r>'''.encode("UTF-8")

print output(data)

With lxml, things are easier and they work. It is Pythonic binding for the libxml2 library and works wonderfully.

>>> from lxml import etree
>>> x = etree.fromstring('''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> name = x[0].get('Name')
>>> print name
中文

And yes, XPath is also supported. The documentation is here.

As for your program, have a look at this:

# -*- coding: utf-8 -*-

import libxml2

def output(data):
  doc = libxml2.parseDoc(data)
  ctxt = doc.xpathNewContext()
  res_rslt = ctxt.xpathEval("/r/e/attribute::Name")
  return res_rslt[0]

data =  u'''<?xml version="1.0" encoding="UTF-8"?><r><e RoleID="3247" Name="中文"></e></r>'''.encode("UTF-8")

print output(data)

回复收藏 0 原文

原野 2024-10-10 00:26:18

我对这类事情的回答似乎总是“使用美丽汤”。我也总是因为它而受到支持（我认为这表明其他人同意我的观点，认为它很好）。

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(u'''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> print soup.r.e['name']
中文

问题是 libxml2 正在将这些字符转换为正确的 XML 实体，这对于 XML 来说是正确的。 Beautiful Soup 没有任何感觉需要正确的概念 - 所以它只是给你你想要的东西。

（请注意，在这种情况下，使用 u'...' 或 '...' 都可以；我只是将其作为 unicode > 因为这样感觉更好 - 无论你做什么，Beautiful Soup 为您提供 Unicode。）

My answer to these sorts of things always seems to be "use Beautiful Soup". And I always get upvoted for it, too (which shows, I think, that others agree with me that it's good).

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(u'''<r><e RoleID="3247" Name="中文"></e></r>''')
>>> print soup.r.e['name']
中文

The thing is that libxml2 is converting those characters into the proper XML entities which for XML is correct. Beautiful Soup doesn't have any such notions of feeling a need to be correct - so it just gives you what you want.

(Note in this case that using either u'...' or '...' will work; I just put it as a unicode because it feels better that way - whatever you do, Beautiful Soup gives you Unicode.)

回复收藏 0 原文

~没有更多了~