使用 xpath 获取部分字符串匹配的 html 标签
html代码是盲目的,它在html中包含字符串“PRICE”。该部分字符串必须与 html 文本匹配。如果文本使用 xpath 匹配(部分匹配)。它应该返回特定的 html 标记路径。
注意:我需要为多个站点自动执行此逻辑。我应该使用通用规则 (用于定位“价格”,获取父标签)
这是示例:
html="""<div id = "price_id">
<span id = "id1"></span>
<div class="price_class">
<bold>
<strong>
<label>PRICE:</label> 125 Rs.
</bold>
</strong>
</br>
</br>
</div>"""
我使用了 lxml
from lxml.html.clean import Cleaner
cleaner =Cleaner(page_structure=False)
cl = cleaner.clean_html(html)
cleaned_html = fromstring(cl)
for element in cleaned_html:
if element.text == 'PRICE':
print "matched"
使用 Xpath 表达式如何编写?
我只需要使用 xpath 表达式获取 div 类路径。
另外问题是我是否找到“PRICE:”字符串。 我应该获得父级有效标签,即“div”,类名为“price_class”。 但在这里我应该跳过或删除不需要的标签,如字体、粗体、斜体...
您能否建议我获取找到的字符串的父有效标签?
The html code is blind and It contains the string "PRICE" in html. That partial string has to be matched with html text.If the text matches(partial match) using xpath.It should return the particular html tag path.
Note: I need to automate this logic for multiple sites.I should have to use the generic rule
(For locating "Price",Fetching Parent tag)
This is example:
html="""<div id = "price_id">
<span id = "id1"></span>
<div class="price_class">
<bold>
<strong>
<label>PRICE:</label> 125 Rs.
</bold>
</strong>
</br>
</br>
</div>"""
I used lxml
from lxml.html.clean import Cleaner
cleaner =Cleaner(page_structure=False)
cl = cleaner.clean_html(html)
cleaned_html = fromstring(cl)
for element in cleaned_html:
if element.text == 'PRICE':
print "matched"
How it would be written using Xpath expression?
I just need to get the div class path using xpath expression.
Also The problem is if I locate the "PRICE:" string.
I should have to get the parent valid tag that is "div" with class name "price_class".
but here i should have to skip or remove the unwanted tags like font,bold,italic...
Could you please suggest me to get the parent valid tag of the located string?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用
ancestor
轴:输出:
编辑:修改问题的更通用解决方案:
它将搜索带有文本“
PRICE:”,然后选择第一个祖先跳过
strong
、bold
、italic
。您可以将更多标签添加到排除列表中。您可以搜索第一个 good 祖先(例如
div
、ul
等),而不是排除列表:You can use the
ancestor
axis:output:
EDIT: More general solution for modified question:
It will search for an element with the text "
PRICE:
" and then select the first ancestor skippingstrong
,bold
,italic
. You can add more tags to the exclude list.Instead of an exclude list, you can search for the first good ancestor (like
div
,ul
, etc):使用:
XPath 是 XML 文档的查询语言。因此它不能修改 XML 文档的结构。为此,必须使用另一种语言(即托管 XPath)。
XSLT 是执行 XML 文档转换的最合适的语言,因为它是专门为此目的而设计的。
Use:
XPath is a query language for XML documents. As such it cannot modify the structure of an XML document. To do so, another language (that is hosting XPath) has to be used.
XSLT is the most appropriate language for performing a transformation of an XML document, as it was especially designed with that purpose.