在 lxml 中测试元素时避免循环

发布于 2025-01-06 02:39:00 字数 996 浏览 2 评论 0原文

我有这个问题,我正在使用 lxml 处理一些表 - 原始源文件是 mhtml 格式,它们是 excel 文件。我需要找到包含标题元素“th”元素的行。我想使用标题元素,但需要它们来自的行以确保我按顺序处理所有内容。

所以我一直在做的是找到所有 th 元素,然后使用 e.getparent() 函数从那些元素中获取行(因为 th 是行的子元素)。但我最终不得不拉动 th 元素两次,一次是为了找到它们并获取行,然后再次将它们从行中取出来解析我正在查找的数据。 这不是最好的方法,所以我想知道是否缺少一些东西。

这是我的代码

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
    headerCells=[e for e in table.iter() if e.tag=='th']
    headerRows=[]
    for headerCell in headerCells:
        if headerCell.getparent().tag=='tr':
            if headerCell.getparent() not in headerRows:
                headerRows.append(headerCell.getparent())
    for headerRow in headerRows:
        newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
        #Now I will extract some data and attributes from the th elements

I have this problem, I am processing some tables using lxml- the original source files are in mhtml format, they are excel files. I am needing to find the rows that contain the header elements 'th' elements. I want to use the header elements but need the rows they came from to make sure I process everything in order.

So what I have been doing is finding all of the th elements and then from those using the e.getparent() function to get the row (since a th is a child of a row). But I end up having to pull the th elements twice, once to find them and get the rows and then again to take them out of the rows to parse the data I am looking for.
This can't be the best way to do this so I am wondering if there is something I am missing.

Here is my code

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables :
    headerCells=[e for e in table.iter() if e.tag=='th']
    headerRows=[]
    for headerCell in headerCells:
        if headerCell.getparent().tag=='tr':
            if headerCell.getparent() not in headerRows:
                headerRows.append(headerCell.getparent())
    for headerRow in headerRows:
        newHeaderCells=[e for e in headerRow.iter() if e.tag=='th']
        #Now I will extract some data and attributes from the th elements

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

对不⑦ 2025-01-13 02:39:00

遍历所有 tr 标签,当您发现里面没有 th 时,直接转到下一个标签。

编辑。方法如下:

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
for table in theTree.iter('table'):
    for row in table.findall('tr'):
        headerCells = list(row.findall('th'))
        if headerCells:
            #extract data from row and headerCells 

Iterate over all tr tags, and just move on to the next one when you find no th inside.

EDIT. This is how:

from lxml import html
theString=unicode(open('c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls').read(),'UTF-8','replace')
theTree=html.fromstring(theString)
for table in theTree.iter('table'):
    for row in table.findall('tr'):
        headerCells = list(row.findall('th'))
        if headerCells:
            #extract data from row and headerCells 
白昼 2025-01-13 02:39:00

为了避免执行两次,您可以使用由行元素作为键控的字典,并将给定行中的所有标题单元格累积到关联列表中,这可以通过表元素的单次传递来完成。要保持行按出现时间排序,您可以使用内置 collections 模块中的 OrderedDict 。这将允许编写类似这样的内容:

from lxml import html
from collections import OrderedDict
f='c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls'
theString=unicode(open(f).read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables:
    headerRowDict=OrderedDict()
    for e in table.iter():
        if e.tag=='th':
            headerRowDict.setdefault(e.getparent(), []).append(e)
    for headerRow in headerRowDict:
        for headerRowCell in headerRow:
            # extract data and attributes from the <th> element from the row...

To avoid doing it twice, you could use a dictionary keyed by row element and accumulate all the header cells from a given row into an assocated list, which can be done in a single pass through the table's elements. To keep rows ordered by when they were seen you can use an OrderedDict from the built-in collections module. This would allow something along these lines to be written:

from lxml import html
from collections import OrderedDict
f='c:\\secexcel\\1314054-R20110331-C20101231-F60-SEQ132.xls'
theString=unicode(open(f).read(),'UTF-8','replace')
theTree=html.fromstring(theString)
tables=[e for e in theTree.iter() if e.tag=='table']
for table in tables:
    headerRowDict=OrderedDict()
    for e in table.iter():
        if e.tag=='th':
            headerRowDict.setdefault(e.getparent(), []).append(e)
    for headerRow in headerRowDict:
        for headerRowCell in headerRow:
            # extract data and attributes from the <th> element from the row...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文