一旦我使用 lxml 识别了 html 文档的一部分的开始和结束部分，我如何获取它们之间的所有内容

发布于 2024-09-14 07:16:15 字数 438 浏览 7 评论 0原文

我正在处理一些 html 文件。我正在尝试找出一种方法来一致地获取文档中存在的某些文本。我知道我想要的部分以一些粗体单词开头，并且我知道该部分以其他粗体单词结尾。

bolded_item=atree.cssselect('b')

myKeys=[item for item in bolded_items if item.text if 'KEY' in item.text]

所以 myKeys 是一个列表，其成员是树中的元素，特别是具有粗体文本并在文本中包含单词“KEY”的元素。

我现在想识别 myKeys 中任意 2 个元素之间的树的所有部分，我希望能够以各种方式操作它们。我正在尝试 getparent、getchildren getnext 以及运行 dir(myKeys[0]) 后看起来可能的所有其他方法，但我没有取得进展。

任何建议将不胜感激

原文

I am working with some html files. I am trying to figure out a way to consistently get to some text that exists in the documents. I know that the section I want begins with some bolded words and I know that the section ends with other bolded words.

bolded_item=atree.cssselect('b')

myKeys=[item for item in bolded_items if item.text if 'KEY' in item.text]

so myKeys is a list whose members are elements from atree, specifically elements that have bolded text and have the word 'KEY' in the text.

I want now to identify all of the parts of the tree between any 2 elements in myKeys I want to be able to manipulate them in various ways. I was playing around with getparent, getchildren getnext and all of the other methods that looked likely after running a dir(myKeys[0]) but I am not making progress.

Any suggestions would be appreciated

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ㄟ。诗瑗 2024-09-21 07:16:15

我建议使用 SAX 来完成此任务。

基本文档可在 http://lxml 获取.de/sax.html#having-sax-events-from-an-elementtree-or-element

您的处理程序应该使用事件而不执行任何操作，直到收到所需的粗体项目，然后将事件写入新的缓冲区/树/无论什么，直到它收到终止粗体项目。

回复收藏 0 原文

清醇 2024-09-21 07:16:15

本着SO的精神，我已经找到了我认为最好的答案，并将自己发布。

import lxml
from lxml import html
testFile=open(r'c:\temp\testlxml.htm').read()
aTree=html.fromstring(testFile)
bolds=aTree.cssselect('b')
theTitles=[item.text for item in bolds if item.text if 'KEY' in item.text]
theBoldKeys=[item for item in bolds if item.text if 'KEY' in item.text]
theFullList=[]
for e in aTree.iter():
    theFullList.append(e)

for numb,item in enumerate(theFullList):
    if item==theBoldItems[0]:
        first=numb
    if item==theBoldItems[1]:
        second=numb
theText=[]
for item in theFullList[first:second]:
    if item.text:
        theText.append(item.text)
    if item.tail:
       theText.append(item.tail)

aString=' '.join(theText)

稍微解释一下。

我的目标是将一些逻辑应用于文档的粗体部分，因为其中包含单词 KEY 的粗体部分定义了文档的不同部分。 TheTitles 是包含单词“KEY”的粗体元素列表。根据我的特定需求，我可能需要标题中任意两个项目之间的所有文本，我可以创建测试和必要的逻辑来从标题中选择项目。

theBoldItems 是实际元素的列表，对于任何 i theTitles[i]==theBoldItems[i].text

接下来我得到 theFullList，它是树中的所有 htm 元素。因为 LXML 构建树的顺序是我知道我想要捕获所有元素 theBoldItems[i] 和 theBoldItems[i+1]。令人高兴的是，Python 构建测试的方式是如此简单。

我现在可以获得所有这些内容的文本，虽然我仍然需要清理一些内容，但我已经成功地撕掉了我可能想要的任何两个项目之间的所有文本。

In the spirit of SO I have figured out what I think is the best answer and am going to post it myself.

import lxml
from lxml import html
testFile=open(r'c:\temp\testlxml.htm').read()
aTree=html.fromstring(testFile)
bolds=aTree.cssselect('b')
theTitles=[item.text for item in bolds if item.text if 'KEY' in item.text]
theBoldKeys=[item for item in bolds if item.text if 'KEY' in item.text]
theFullList=[]
for e in aTree.iter():
    theFullList.append(e)

for numb,item in enumerate(theFullList):
    if item==theBoldItems[0]:
        first=numb
    if item==theBoldItems[1]:
        second=numb
theText=[]
for item in theFullList[first:second]:
    if item.text:
        theText.append(item.text)
    if item.tail:
       theText.append(item.tail)

aString=' '.join(theText)

A little bit of explanation.

My goal is to apply some logic to the bolded parts of the documents as those bolded sections that have the word KEY in them define different sections of the document. TheTitles is a list of the bolded elements that have the word 'KEY' included. Based on my particular needs I might want all of the text between any two items from theTitles, I can create tests and the necessary logic to select items from theTitles.

theBoldItems is a list of the actual elements, for any i theTitles[i]==theBoldItems[i].text

next I get theFullList which is all of the htm elements in the tree. Because LXML builds the tree in order I know that I want to capture all of the elements theBoldItems[i] and theBoldItems[i+1]. And the nice thing is that the way Python is built the test is that easy.

I can now get the text for all of those things and while I still need to clean it up some I have successfully ripped out all of the text between any two items I might want.

回复收藏 0 原文

~没有更多了~