lxml 将元素转换为elementtree

发布于 2024-12-26 11:36:26 字数 2305 浏览 4 评论 0原文

以下测试代码读取一个文件,并使用 lxml.html 生成页面的 DOM/Graph 的叶节点。

然而,我也试图弄清楚如何从“字符串”获取输入。使用:

lxml.html.fromstring(s)

不起作用,因为这会生成 Element 而不是 ElementTree

因此,我试图弄清楚如何将元素转换为 ElementTree

[我的测试代码]

import lxml.html
from lxml import etree    # trying this to see if needed 
                          # to convert from element to elementtree


  #cmd='cat osu_test.txt'
  cmd='cat o2.txt'
  proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
  s=proc.communicate()[0].strip()

  # s contains HTML not XML text
  #doc = lxml.html.parse(s)
  doc = lxml.html.parse('osu_test.txt')
  doc1 = lxml.html.fromstring(s)

  for node in doc.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

  nt = etree.ElementTree(doc1)        <<<<< doesn't work.. so what will??
  for node in nt.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

更新1:(

解析html而不是xml) 添加了阿巴斯建议的更改。出现以下错误:

    doc1 = etree.fromstring(s)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220

更新 2:

设法使测试正常工作。我不太清楚为什么。如果有高手的人想要提供解释,这将有助于未来偶然发现这一点的人。

from cStringIO import StringIO
from lxml.html import parse

doc1 = parse(StringIO(s))

for node in doc1.iter():
    if len(node) == 0:
        print "aaa ", node.tag, doc1.getpath(node)

看来 StringIO 模块/类实现了 IO 功能,满足了解析包继续处理测试 html 的输入字符串所需的功能。类似于其他语言中提供的铸造功能也许......

The following test code reads a file, and using lxml.html generates the leaf nodes of the DOM/Graph for the page.

However, I'm also trying to figure out how to get the input from a "string". Using:

lxml.html.fromstring(s)

doesn't work, as this generates an Element as opposed to an ElementTree.

So, I'm trying to figure out how to convert an element to an ElementTree.

[my test code]

import lxml.html
from lxml import etree    # trying this to see if needed 
                          # to convert from element to elementtree


  #cmd='cat osu_test.txt'
  cmd='cat o2.txt'
  proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
  s=proc.communicate()[0].strip()

  # s contains HTML not XML text
  #doc = lxml.html.parse(s)
  doc = lxml.html.parse('osu_test.txt')
  doc1 = lxml.html.fromstring(s)

  for node in doc.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

  nt = etree.ElementTree(doc1)        <<<<< doesn't work.. so what will??
  for node in nt.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

UPDATE 1:

(parsing html instead of xml)
Added the changes suggested by Abbas. got the following errs:

    doc1 = etree.fromstring(s)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220

UPDATE 2:

Managed to get the test working. I'm not exactly sure why. If someone with py chops wants to provide an explanation, that would help future people who stumble on this.

from cStringIO import StringIO
from lxml.html import parse

doc1 = parse(StringIO(s))

for node in doc1.iter():
    if len(node) == 0:
        print "aaa ", node.tag, doc1.getpath(node)

It appears that the StringIO module/class implements IO functionality which satisfies what the parse package needs to go ahead and process the input string for the test html. similar to what casting provides in other languages perhaps...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

音栖息无 2025-01-02 11:36:26

要从 _Element(使用 lxml.html.fromstring 生成)获取根树,您可以使用 getroottree 方法:

doc = lxml.html.parse(s)
tree = doc.getroottree()

To get the root tree from an _Element (generated with lxml.html.fromstring), you can use the getroottree method:

doc = lxml.html.parse(s)
tree = doc.getroottree()
_蜘蛛 2025-01-02 11:36:26

etree.fromstring 方法解析 XML 字符串并返回根元素。 etree.ElementTree 类是元素的树包装器,因此需要一个元素进行实例化。

因此,将根元素传递给 etree.ElementTree() 构造函数应该可以满足您的需求:

root = etree.fromstring(s)
nt = etree.ElementTree(root)

The etree.fromstring method parses an XML string and returns a root element. The etree.ElementTree class is a tree wrapper around an element and as such requires an element for instantiation.

Therefore, passing the root element to the etree.ElementTree() constructor should give you what you want:

root = etree.fromstring(s)
nt = etree.ElementTree(root)
月棠 2025-01-02 11:36:26

一个 _Element,由如下调用返回:

tree = etree.HTML(result.read(), etree.HTMLParser())

可以像这样创建一个 _ElementTree

tree    = tree.getroottree() # convert _Element to _ElementTree

希望这就是您所期望的。

An _Element, such that is returned by a call like:

tree = etree.HTML(result.read(), etree.HTMLParser())

Can be made an _ElementTree like so:

tree    = tree.getroottree() # convert _Element to _ElementTree

Hope that's what you expect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文