使用 python 和 lxml 获取元素并更改元素文本
首先,我知道 StackOverflow 上已经有很多关于 python 和 lxml 的问题,而且我确实阅读了其中的大部分(如果不是全部)。现在我正在寻找这个问题的更全面的答案。
我正在做一些 HTML 转换,我需要从语法上解析 HTML,然后对 href
、img
等进行一些内容更改。
这是我现在所拥有的的简化版本:
with open(fileName, "r") as inFile:
inputS = inFile.read()
myTree = fromstring(inputS) #parse etree from HTML content
breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb
h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content
getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select
所以基本上这就是我目前所知道的。还有其他方法可以使用 lxml 获取元素/属性吗?我知道它们可能不是最好的方法,但请耐心等待,我对这整件事都很陌生。
以下是我想做的事情。我有:
<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>
它们可以嵌套在其他元素中,例如 div
、p
等。我想做的是以编程方式查找这些元素;对于图像,我想提取 src
,对其进行一些操作并将 src
设置为其他内容(例如,src="images/something.jpg “
到 src="something_images.jpg"
),与 href
相同,我想更改它以使其指向其他位置。
除此之外,我还想从树中删除一些元素以简化它,例如:
<head>
<title>something goes here</title>
</head>
<div>
<p id="some_p"> Some content </p>
</div>
我想删除头节点和 div,我将能够使用 id="some_p" 获取 p
,有什么方法可以抓取父元素吗?还有什么方法可以删除这些元素吗? (在本例中,查找 head
,删除 head
,然后查找 id="some_p"
,获取 parent
并删除它,
谢谢
! ========
更新:我已经找到了这个问题的解决方案,并且已经使用 lxml.etree 完成了编码,一旦 stackoverflow 允许,我将发布答案。我真心希望这个问题的答案能够对其他需要处理 HTML 解析的人有所帮助!
First thing first, I know there are many questions regarding python and lxml on StackOverflow already, and I did read most of them, if not all. Right now I am looking for a more comprehensive answer in this question.
I am doing some HTML conversion and I need to grammatically parse the HTML and then do some content changes to href
, img
and such.
This is a simplified version of what I have right now:
with open(fileName, "r") as inFile:
inputS = inFile.read()
myTree = fromstring(inputS) #parse etree from HTML content
breadCrumb = myTree.get_element_by_id("breadcrumb") #a list of elements with matching id
breadCrumbContent = breadCrumb[0].text_content().strip() #text content of bread crumb
h1 = myTree.xpath('//h1') #another way, get elements by xpath
h1Content = h1[0].text_content().strip() #get text content
getTail = myTree.cssselect('table.results > tr > td > a + span + br') #get list of elements using css select
So basically that's what I know at the moment. Is there any other ways to get elements/attributes using lxml? I know that they may not be the best way to do it but bear with me, i am new to this whole thing.
Following is what I want to do. I have:
<img src="images/macmail10.gif" alt="" width="555" height="485" /><br />
<a href="http://www.some_url.com/faq/general_faq.html" target="_blank">General FAQs page</a>
They can be nested inside other elements like div
, p
whatsoever. What I want to do is to programatically look for those elements; for image, I want to extract the src
, do some manipulation with it and set src
to something else (for example, src="images/something.jpg"
into src="something_images.jpg"
), the same thing with href
, i want to change it to make it point to other place.
Other than that, I also want to remove some elements from the tree to simplify it, for example:
<head>
<title>something goes here</title>
</head>
<div>
<p id="some_p"> Some content </p>
</div>
I would want to remove the head node and the div, i would be able to get the p with id="some_p"
, is there any ways to grab the parent element? is there also any way to remove those elements? (in this case look for head
, remove head
and then look for id="some_p"
, get the parent
and delete it.
Thank you!
==================================================
UPDATE: I already found the solution to this and already finished coding using lxml.etree. I will post the answer to that as soon as stackoverflow allows me. I truly hope that the answer for this question would be of help to other people when they have to deal with HTML parsing!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
lxml
和ElementTree
非常相似。事实上,lxml 文档站点的 ElementTree 部分仅指向 ElementTree 的文档。您可以尝试阅读概述页面底部的 ElementTree 教程和示例。由于 ElementTree 是 Python 发行版的一部分,因此它往往被广泛记录(并且很容易通过 Google 搜索)。一旦您理解了这一点,如果需要,可以使用 ElementTree 中最初未找到的一些 lmlx 魔法进行扩展。例如,lxml 维护每个元素的父关系,而 ElementTree 则不然。您可以向 ElementTree 添加父关系,但这并不是一个简单的示例。
我就是这么学的。
lxml
andElementTree
are quite similar. The ElementTree portion of the lxml documentation site, in fact, just points to ElementTree's documentation.You might try working through the ElementTree tutorials and examples at the bottom of the overview page. Since ElementTree is part of the Python distribution, it tends to be widely documented (and easily Googled). Once you grok that, extend with some of the lmlx magic not initial found in ElementTree if you need to. For example, lxml maintains parent relationships for every element and ElementTree does not. You can add parent relationships to ElementTree, but it is not an easy example to start with.
That how I learned it.