lxml 将元素转换为elementtree
以下测试代码读取一个文件,并使用 lxml.html
生成页面的 DOM/Graph 的叶节点。
然而,我也试图弄清楚如何从“字符串”获取输入。使用:
lxml.html.fromstring(s)
不起作用,因为这会生成 Element
而不是 ElementTree
。
因此,我试图弄清楚如何将元素转换为 ElementTree。
[我的测试代码]
import lxml.html
from lxml import etree # trying this to see if needed
# to convert from element to elementtree
#cmd='cat osu_test.txt'
cmd='cat o2.txt'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
s=proc.communicate()[0].strip()
# s contains HTML not XML text
#doc = lxml.html.parse(s)
doc = lxml.html.parse('osu_test.txt')
doc1 = lxml.html.fromstring(s)
for node in doc.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
nt = etree.ElementTree(doc1) <<<<< doesn't work.. so what will??
for node in nt.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
更新1:(
解析html而不是xml) 添加了阿巴斯建议的更改。出现以下错误:
doc1 = etree.fromstring(s)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220
更新 2:
设法使测试正常工作。我不太清楚为什么。如果有高手的人想要提供解释,这将有助于未来偶然发现这一点的人。
from cStringIO import StringIO
from lxml.html import parse
doc1 = parse(StringIO(s))
for node in doc1.iter():
if len(node) == 0:
print "aaa ", node.tag, doc1.getpath(node)
看来 StringIO 模块/类实现了 IO 功能,满足了解析包继续处理测试 html 的输入字符串所需的功能。类似于其他语言中提供的铸造功能也许......
The following test code reads a file, and using lxml.html
generates the leaf nodes of the DOM/Graph for the page.
However, I'm also trying to figure out how to get the input from a "string". Using:
lxml.html.fromstring(s)
doesn't work, as this generates an Element
as opposed to an ElementTree
.
So, I'm trying to figure out how to convert an element to an ElementTree.
[my test code]
import lxml.html
from lxml import etree # trying this to see if needed
# to convert from element to elementtree
#cmd='cat osu_test.txt'
cmd='cat o2.txt'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
s=proc.communicate()[0].strip()
# s contains HTML not XML text
#doc = lxml.html.parse(s)
doc = lxml.html.parse('osu_test.txt')
doc1 = lxml.html.fromstring(s)
for node in doc.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
nt = etree.ElementTree(doc1) <<<<< doesn't work.. so what will??
for node in nt.iter():
if len(node) == 0:
print "aaa ",node.tag, doc.getpath(node)
#print "aaa ",node.tag
UPDATE 1:
(parsing html instead of xml)
Added the changes suggested by Abbas. got the following errs:
doc1 = etree.fromstring(s)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220
UPDATE 2:
Managed to get the test working. I'm not exactly sure why. If someone with py chops wants to provide an explanation, that would help future people who stumble on this.
from cStringIO import StringIO
from lxml.html import parse
doc1 = parse(StringIO(s))
for node in doc1.iter():
if len(node) == 0:
print "aaa ", node.tag, doc1.getpath(node)
It appears that the StringIO module/class implements IO functionality which satisfies what the parse package needs to go ahead and process the input string for the test html. similar to what casting provides in other languages perhaps...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
要从
_Element
(使用lxml.html.fromstring
生成)获取根树,您可以使用getroottree
方法:To get the root tree from an
_Element
(generated withlxml.html.fromstring
), you can use thegetroottree
method:etree.fromstring
方法解析 XML 字符串并返回根元素。etree.ElementTree
类是元素的树包装器,因此需要一个元素进行实例化。因此,将根元素传递给
etree.ElementTree()
构造函数应该可以满足您的需求:The
etree.fromstring
method parses an XML string and returns a root element. Theetree.ElementTree
class is a tree wrapper around an element and as such requires an element for instantiation.Therefore, passing the root element to the
etree.ElementTree()
constructor should give you what you want:一个
_Element
,由如下调用返回:可以像这样创建一个
_ElementTree
:希望这就是您所期望的。
An
_Element
, such that is returned by a call like:Can be made an
_ElementTree
like so:Hope that's what you expect.