Python、lxml 和使用 lxml.html.tostring(el) 删除外部标签

发布于 2025-01-05 18:27:53 字数 401 浏览 2 评论 0原文

我正在使用下面的内容来获取一个部分的所有 html 内容以保存到数据库

el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)

产品描述有一个如下所示的标签:

<div id='productDescription'>

     <THE HTML CODE I WANT>

</div>

代码效果很好,给了我所有的 html 代码,但是如何删除外部层,即

和结束标记

I am using the below to get all of the html content of a section to save to a database

el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)

The product description has a tag that looks like this:

<div id='productDescription'>

     <THE HTML CODE I WANT>

</div>

The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'> and the closing tag </div> ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

若能看破又如何 2025-01-12 18:27:54

这是一个可以完成您想要的功能的函数。

def strip_outer(xml):
    """
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML         http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
    ...   <mrow>
    ...     <msup>
    ...       <mi>x</mi>
    ...       <mn>2</mn>
    ...     </msup>
    ...     <mo> + </mo>
    ...     <mi>x</mi>
    ...   </mrow>
    ... </math>'''
    >>> so = strip_outer(xml)
    >>> so.splitlines()[0]=='<mrow>'
    True

    """
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
    rx = lxml.etree.XML(xml)
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
    uc=lxml.etree.tounicode(rx)
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
    return uc.strip()

Here is a function that does what you want.

def strip_outer(xml):
    """
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML         http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd">
    ...   <mrow>
    ...     <msup>
    ...       <mi>x</mi>
    ...       <mn>2</mn>
    ...     </msup>
    ...     <mo> + </mo>
    ...     <mi>x</mi>
    ...   </mrow>
    ... </math>'''
    >>> so = strip_outer(xml)
    >>> so.splitlines()[0]=='<mrow>'
    True

    """
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element
    rx = lxml.etree.XML(xml)
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes
    uc=lxml.etree.tounicode(rx)
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again
    return uc.strip()
記柔刀 2025-01-12 18:27:54

使用正则表达式。

def strip_outer_tag(html_fragment):
    import re
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>
, re.DOTALL)
    return outer_tag.search(html_fragment).group(1)

html_fragment = strip_outer_tag(tostring(el, encoding='unicode'))  # `encoding` is optionaly

Use regexp.

def strip_outer_tag(html_fragment):
    import re
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>
, re.DOTALL)
    return outer_tag.search(html_fragment).group(1)

html_fragment = strip_outer_tag(tostring(el, encoding='unicode'))  # `encoding` is optionaly
相思故 2025-01-12 18:27:53

您可以将每个子项单独转换为字符串:

text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

或者以更黑客的方式:

el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]

You could convert each child to string individually:

text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

Or in even more hackish way:

el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
余生一个溪 2025-01-12 18:27:53

如果您的 productDescription div div 包含混合文本/元素内容,例如

<div id='productDescription'>
  the
  <b> html code </b>
  i want
</div>

您可以使用 xpath('node()')遍历:

s = ''
for node in el.xpath('node()'):
    if isinstance(node, basestring):
        s += node
    else:
        s += lxml.html.tostring(node, with_tail=False)

if your productDescription div div contains mixed text/elements content, e.g.

<div id='productDescription'>
  the
  <b> html code </b>
  i want
</div>

you can get the content (in string) using xpath('node()') traversal:

s = ''
for node in el.xpath('node()'):
    if isinstance(node, basestring):
        s += node
    else:
        s += lxml.html.tostring(node, with_tail=False)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文