在Python中使用lxml获取的标题属性

发布于 2024-11-19 04:09:06 字数 831 浏览 0 评论 0原文

我想使用 Python 从 this 网站提取 oneel-iner-texts 。 HTML 中的消息如下所示:

<div class="olh_message"> 
    <p>foobarbaz <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" /></p> 
</div> 

到目前为止,我的代码如下所示:

import lxml.html
url = "http://www.scenemusic.net/demovibes/oneliner/"
xpath = "//div[@class='olh_message']/p"
tree = lxml.html.parse(url)
texts = tree.xpath(xpath)
texts = [text.text_content() for text in texts]
print(texts)

现在,但是,我只得到 foobarbaz,但是我还想获取其中的 img 的标题参数,所以在这个例子中foobarbaz :necta:。看来我需要 lxml 的 DOM 解析器来做到这一点,但我不知道如何做。任何人都可以给我提示吗?

提前致谢!

I want to extract the onel-iner-texts from this website using Python. The messages in HTML look like this:

<div class="olh_message"> 
    <p>foobarbaz <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" /></p> 
</div> 

My code looks like this so far:

import lxml.html
url = "http://www.scenemusic.net/demovibes/oneliner/"
xpath = "//div[@class='olh_message']/p"
tree = lxml.html.parse(url)
texts = tree.xpath(xpath)
texts = [text.text_content() for text in texts]
print(texts)

Now, however, I only get foobarbaz, I however would like to get the title-argument of the img's in it as well, so in this example foobarbaz :necta:. It seems I need lxml's DOM parser to do it, however I have no idea how. Anyone can give me a hint?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

尘曦 2024-11-26 04:09:06

试试这个

  import lxml.html
  url = "http://www.scenemusic.net/demovibes/oneliner/"
  parser = lxml.etree.HTMLParser()
  tree = lxml.etree.parse(url, parser)
  texts = tree.xpath("//div[@class='olh_message']/p/img/@title")

try this

  import lxml.html
  url = "http://www.scenemusic.net/demovibes/oneliner/"
  parser = lxml.etree.HTMLParser()
  tree = lxml.etree.parse(url, parser)
  texts = tree.xpath("//div[@class='olh_message']/p/img/@title")
和我恋爱吧 2024-11-26 04:09:06

使用

//div[@class='olh_message']/p/node()

他选择任何 p 元素的所有子节点(元素、文本节点、PI 和注释节点),该元素是任何 div 元素,其 class 属性为 'olh_message'

使用 XSLT 作为 XPath 宿主进行验证

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select="//div[@class='olh_message']/p/node()"/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于以下 XML 文档时

<div class="olh_message">
    <p>foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" />
    </p>
</div>

生成所需的正确结果(显示XPath 表达式已经选择了所需的节点):

foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:"/>

Use:

//div[@class='olh_message']/p/node()

his selects all children nodes (elements, text-nodes, PIs and comment-nodes) of any p element that is a child of any div element, whose class attribute is 'olh_message'.

Verification using XSLT as the host of XPath:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select="//div[@class='olh_message']/p/node()"/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<div class="olh_message">
    <p>foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:" />
    </p>
</div>

the wanted, correct result is produced (showing that exactly the wanted nodes have been selected by the XPath expression):

foobarbaz 
        <img src="/static/emoticons/support-our-fruits.gif" title=":necta:"/>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文