Python lxml 包装元素

发布于 2024-11-08 02:22:53 字数 1309 浏览 2 评论 0原文

我想知道使用 lxml 和 Python 将一个元素与另一个元素包装的最简单方法是什么，例如，如果我有一个 html 片段：

<h1>The cool title</h1>
<p>Something Neat</p>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
<p>The end of the snippet</p>

我想用这样的节元素包装表元素：

<h1>The cool title</h1>
<p>Something Neat</p>
<section>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
</section>
<p>The end of the snippet</p>

我想做的另一件事是使用特定属性搜索 xml 文档中的 h1s，然后包装所有元素，直到元素中的下一个 h1 标记，例如：

<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>

转换为：

<section>
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
</section>
<section>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
</section>

感谢所有帮助，克里斯

原文

I was wondering what the easiest way to wrap an element with another element using lxml and Python for example if I have a html snippet:

<h1>The cool title</h1>
<p>Something Neat</p>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
<p>The end of the snippet</p>

And I want to wrap the table element with a section element like this:

<h1>The cool title</h1>
<p>Something Neat</p>
<section>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
</section>
<p>The end of the snippet</p>

Another thing I would like to do is scour the xml document for h1s with a certain attribute and then wrap all of the elements until the next h1 tag in an element for example:

<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>

Converted to:

<section>
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
</section>
<section>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
</section>

Thanks for all the help,
Chris

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东走西顾 2024-11-15 02:22:53

lxml 对于解析格式良好的 xml 非常有用，但如果您有非 xhtml html，则不太好。如果是这种情况，那么按照系统化程序的建议使用 BeautifulSoup 。

使用 lxml，这是在文档中的所有表格周围插入一个部分的相当简单的方法：

import lxml.etree

TEST="<html><h1>...</html>"

def insert_section(root):
    tables = root.findall(".//table")
    for table in tables:
        section = ET.Element("section")
        table.addprevious(section)
        section.insert(0, table)   # this moves the table

root = ET.fromstring(TEST)
insert_section(root)
print ET.tostring(root)

您可以执行类似的操作来包装标题，但您需要迭代所有要包装的元素并将它们移动到该部分。 element.index(child) 和列表切片可能会有所帮助。

lxml's awesome for parsing well formed xml, but's not so good if you've got non-xhtml html. If that's the case then go for BeautifulSoup as suggested by systemizer.

With lxml, this is a fairly easy way to insert a section around all tables in the document:

import lxml.etree

TEST="<html><h1>...</html>"

def insert_section(root):
    tables = root.findall(".//table")
    for table in tables:
        section = ET.Element("section")
        table.addprevious(section)
        section.insert(0, table)   # this moves the table

root = ET.fromstring(TEST)
insert_section(root)
print ET.tostring(root)

You could do something similar to wrap the headings, but you would need to iterate through all the elements you want to wrap and move them to the section. element.index(child) and list slices might help here.

回复收藏 0 原文