使用 xpath 获取部分字符串匹配的 html 标签

发布于 2024-12-24 17:00:39 字数 988 浏览 0 评论 0原文

html代码是盲目的,它在html中包含字符串“PRICE”。该部分字符串必须与 html 文本匹配。如果文本使用 xpath 匹配(部分匹配)。它应该返回特定的 html 标记路径。

注意:我需要为多个站点自动执行此逻辑。我应该使用通用规则 (用于定位“价格”,获取父标签)

这是示例:

html="""<div id = "price_id">
  <span id = "id1"></span>
  <div class="price_class">
   <bold>
   <strong>
   <label>PRICE:</label> 125 Rs.
   </bold>
   </strong>
   </br>
   </br>

</div>"""

我使用了 lxml

 from lxml.html.clean import Cleaner     

 cleaner =Cleaner(page_structure=False)
 cl = cleaner.clean_html(html)
 cleaned_html = fromstring(cl)

 for element in cleaned_html:
      if element.text == 'PRICE':
          print "matched"

使用 Xpath 表达式如何编写?

我只需要使用 xpath 表达式获取 div 类路径。

另外问题是我是否找到“PRICE:”字符串。 我应该获得父级有效标签,即“div”,类名为“price_class”。 但在这里我应该跳过或删除不需要的标签,如字体、粗体、斜体...

您能否建议我获取找到的字符串的父有效标签?

The html code is blind and It contains the string "PRICE" in html. That partial string has to be matched with html text.If the text matches(partial match) using xpath.It should return the particular html tag path.

Note: I need to automate this logic for multiple sites.I should have to use the generic rule
(For locating "Price",Fetching Parent tag)

This is example:

html="""<div id = "price_id">
  <span id = "id1"></span>
  <div class="price_class">
   <bold>
   <strong>
   <label>PRICE:</label> 125 Rs.
   </bold>
   </strong>
   </br>
   </br>

</div>"""

I used lxml

 from lxml.html.clean import Cleaner     

 cleaner =Cleaner(page_structure=False)
 cl = cleaner.clean_html(html)
 cleaned_html = fromstring(cl)

 for element in cleaned_html:
      if element.text == 'PRICE':
          print "matched"

How it would be written using Xpath expression?

I just need to get the div class path using xpath expression.

Also The problem is if I locate the "PRICE:" string.
I should have to get the parent valid tag that is "div" with class name "price_class".
but here i should have to skip or remove the unwanted tags like font,bold,italic...

Could you please suggest me to get the parent valid tag of the located string?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

东风软 2024-12-31 17:00:39

您可以使用ancestor轴:

import lxml.html

html = ...
doc = lxml.html.fromstring(html)

for element in doc.xpath('//label[contains(text(), "PRICE:")]/ancestor::div[@class="price_class"]'):
    print 'Found %s: %s' % (element.tag, element.text_content().strip())

输出:

Found div: PRICE: 125 Rs.

编辑:修改问题的更通用解决方案:

doc.xpath('//*[contains(text(), "PRICE:")]/\
          ancestor::*[not(self::strong|self::bold|self::italic)][1]')

它将搜索带有文本“PRICE:”,然后选择第一个祖先跳过 strongbolditalic。您可以将更多标签添加到排除列表中。

您可以搜索第一个 good 祖先(例如 divul 等),而不是排除列表:

doc.xpath('//*[contains(text(), "PRICE:")]/ancestor::*[self::div|self::ul][1]')

You can use the ancestor axis:

import lxml.html

html = ...
doc = lxml.html.fromstring(html)

for element in doc.xpath('//label[contains(text(), "PRICE:")]/ancestor::div[@class="price_class"]'):
    print 'Found %s: %s' % (element.tag, element.text_content().strip())

output:

Found div: PRICE: 125 Rs.

EDIT: More general solution for modified question:

doc.xpath('//*[contains(text(), "PRICE:")]/\
          ancestor::*[not(self::strong|self::bold|self::italic)][1]')

It will search for an element with the text "PRICE:" and then select the first ancestor skipping strong, bold, italic. You can add more tags to the exclude list.

Instead of an exclude list, you can search for the first good ancestor (like div, ul, etc):

doc.xpath('//*[contains(text(), "PRICE:")]/ancestor::*[self::div|self::ul][1]')
场罚期间 2024-12-31 17:00:39

我只需要使用 xpath 表达式获取 div 类路径。

使用

//*[contains(text(), 'PRICE')]/ancestor::div[1]/@class

另外问题是我是否找到“PRICE:”字符串。我应该必须
获取父有效标签“div”,类名为“price_class”。
但在这里我应该跳过或删除不需要的标签,例如
字体、粗体、斜体...

XPath 是 XML 文档的查询语言。因此它不能修改 XML 文档的结构。为此,必须使用另一种语言(即托管 XPath)。

XSLT 是执行 XML 文档转换的最合适的语言,因为它是专门为此目的而设计的。

I just need to get the div class path using xpath expression.

Use:

//*[contains(text(), 'PRICE')]/ancestor::div[1]/@class

Also The problem is if I locate the "PRICE:" string. I should have to
get the parent valid tag that is "div" with class name "price_class".
but here i should have to skip or remove the unwanted tags like
font,bold,italic...

XPath is a query language for XML documents. As such it cannot modify the structure of an XML document. To do so, another language (that is hosting XPath) has to be used.

XSLT is the most appropriate language for performing a transformation of an XML document, as it was especially designed with that purpose.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文