查找具有特定属性值的所有标签
如何迭代具有特定属性和特定值的所有标签?例如,假设我们只需要 data1、data2 等。
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting tag>
<interesting attrib1="yes, this is what we want">
<group>
<line>
data
</line>
</group>
<group>
<line>
data1
<line>
</group>
<group>
<line>
data2
<line>
</group>
</interesting>
</body>
</html>
我尝试了 BeautifulSoup 但它无法解析该文件。不过,lxml 的解析器似乎可以工作:
broken_html = get_sanitized_data(SITE)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
我不熟悉它的 API,并且不知道如何使用 getiterator 或 xpath。
How can I iterate over all tags which have a specific attribute with a specific value? For instance, let's say we need the data1, data2 etc... only.
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting tag>
<interesting attrib1="yes, this is what we want">
<group>
<line>
data
</line>
</group>
<group>
<line>
data1
<line>
</group>
<group>
<line>
data2
<line>
</group>
</interesting>
</body>
</html>
I tried BeautifulSoup but it can't parse the file. lxml's parser, though, seems to work:
broken_html = get_sanitized_data(SITE)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
I am not familiar with its API, and I could not figure out how to use either getiterator or xpath.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一种方法,使用 lxml 和 XPath
'descendant::*[ @attrib1="是的,这就是我们想要的"]'
。 XPath 告诉 lxml 查看当前节点的所有后代,并返回那些attrib1
属性等于“是的,这就是我们想要的”
的节点。Here's one way, using lxml and the XPath
'descendant::*[@attrib1="yes, this is what we want"]'
. The XPath tells lxml to look at all the descendants of the current node and return those with anattrib1
attribute equal to"yes, this is what we want"
.